Seeking tutorials and information on load-balancing between threads

Seeking tutorials and information on load-balancing between threads - multithreading

I know the term "Load Balancing" can be very broad, but the subject I'm trying to explain is more specific, and I don't know the proper terminology. What I'm building is a set of Server/Client applications. The server needs to be able to handle a massive amount of data transfer, as well as client connections, so I started looking into multi-threading.
There's essentially 3 ways I can see implementing any sort of threading for the server...
One thread handling all requests (defeats the purpose of a thread if 500 clients are logged in)
One thread per user (which is risky to create 1 thread for each of the 500 clients)
Pool of threads which divide the work evenly for any number of clients (What I'm seeking)
The third one is what I'd like to know. This consists of a setup like this:
Maximum 250 threads running at once
500 clients will not create 500 threads, but share the 250
A Queue of requests will be pending to be passed into a thread
A thread is not tied down to a client, and vice-versa
Server decides which thread to send a request to based on activity (load balance)
I'm currently not seeking any code quite yet, but information on how a setup like this works, and preferably a tutorial to accomplish this in Delphi (XE2). Even a proper word or name to put on this subject would be sufficient so I can do the searching myself.
EDIT
I found it necessary to explain a little about what this will be used for. I will be streaming both commands and images, there will be a double-socket setup where there's one "Main Command Socket" and another "Add-on Image Streaming Socket". So really one connection is 2 socket connections.
Each connection to the server's main socket creates (or re-uses) an object representing all the data needed for that connection, including threads, images, settings, etc. For every connection to the main socket, a streaming socket is also connected. It's not always streaming images, but the command socket is always ready.
The point is that I already have a threading mechanism in my current setup (1 thread per session object) and I'd like to shift that over to a pool-like multithreading environment. The two connections together require a higher-level control over these threads, and I can't rely on something like Indy to keep these synchronized, I'd rather know how things are working than to learn to trust something else to do the work for me.

IOCP server. It's the only high-performance solution. It's essentially asynchronous in user mode, ('overlapped I/O in M$-speak), a pool of threads issue WSARecv, WSASend, AcceptEx calls and then all wait on an IOCP queue for completion records. When something useful happens, a kernel threadpool performs the actual I/O and then queues up the completion records.
You need at least a buffer class and socket class, (and probably others for high-performance - objectPool and pooledObject classes so you can make socket and buffer pools).

500 threads may not be an issue on a server class computer. A blocking TCP thread doesn't do much while it's waiting for the server to respond.
There's nothing stopping you from creating some type of work queue on the server side, served by a limited size pool of threads. A simple thread-safe TList works great as a queue, and you can easily put a message handler on each server thread for notifications.
Still, at some point you may have too much work, or too many threads, for the server to handle. This is usually handled by adding another application server.
To ensure scalability, code for the idea of multiple servers, and you can keep scaling by adding hardware.
There may be some reason to limit the number of actual work threads, such as limiting lock contention on a database, or something similar, however, in general, you distribute work by adding threads, and let the hardware (CPU, redirector, switch, NAS, etc.) schedule the load.

Your implementation is completely tied to the communications components you use. If you use Indy, or anything based on Indy, it is one thread per connection - period! There is no way to change this. Indy will scale to 100's of connections, but not 1000's. Your best hope to use thread pools with your communications components is IOCP, but here your choices are limited by the lack of third-party components. I have done all the investigation before and you can see my question at stackoverflow.com/questions/7150093/scalable-delphi-tcp-server-implementation.
I have a fully working distributed development framework (threading and comms) that has been used in production for over 3 years now across more than a half-dozen separate systems and basically covers everything you have asked so far. The code can be found on the web as well.

Related

Thread in an event-driven vs non-event driven web server

The following two diagrams are my understanding on how threads work in a event-driven web server (like Node.js + JavaScript) compared to a non-event driven web server (like IIS + C#)
From the diagram is easy to tell that on a traditional web server the number of threads used to perform 3 long running operations is larger than on a event-driven web server (3 vs 1.)
I think I got the "traditional web server" counts correct (3) but I wonder about the event-driven one (1). Here are my questions:
Is it correct to assume that only one thread was used in the event-driven scenario? That can't be correct, something must have been created to handle the I/O tasks. Right?
How did the evented server handled the I/O? Let's say that the I/O was to read from a database. I suspect that the web server had to create a thread to hand off the job of connecting to the database? Right?
If the event-driven web server indeed created threads to handle the I/O where is the gain?
A possible explanation for my confusion could be that on both scenarios, traditional and event-driven, three separate threads were indeed created to handle the I/O (not shown in the pictures) but the difference is really on the number of threads on the web server per-se, not on the I/O threads. Is that accurate?

Node may use threads for IO. The JS code runs in a single thread, but all the IO requests are running in parallel threads. If you want some JS code to run in parallel threads, use thread-a-gogo or some other packages out there which mitigate that behaviour.
Same as 1., threads are created by Node for IO operations.
You don't have to handle threading, unless you want to. Easier to develop. At least that's my point of view.
A node application can be coded to run like another web server. Typically, JS code runs in a single thread, but there are ways to make it behave differently.
Personally, I recommend threads-a-gogo (the package name isn't that revealing, but it is easy to use) if you want to experiment with threads. It's faster.
Node also supports multiple processes, you may run a completely separate process if you also want to try that out.

The best way to picture NodeJS is like a furious squirrel (i.e. your thread) running in a wheel with an infinite number of pigeons (your I/O) available to pass messages around.
I/O in node is "free". Your squirrel works to set up the connection and send the pigeon off, then can go on to do other things while the pigeon retrieves the data, only dealing with the data when the pigeon returns.
If you write bad code, you can end up having the squirrel waiting for each pigeon.
So always write non-blocking i/o code.
If you can encourage your Pigeons to promise to come back ;)
Promises and generators are probably the best approach you can take to this.
HOWEVER you can always use Node cluster to establish a master squirrel that will procreate child squirrels based on the number of CPUs the master squirrel can find to dole out the work.
Hope this helps and note the complete lack of a car analogy.

Multithreaded Corba Client

There is a lot on multithreading on the Corba server side, but I'm interested about the client side. We have a multithreaded client (Solaris, Orbix 6.3) with a Corba singleton "manager" that initialises the ORB. During runtime 'lsof' shows only one TCP connection to the Corba server, so all synchronous calls made from the client worker threads should be serialised.
Would like to change this arrangement to take advantage of parallelism: each thread to manage its own connection. I've changed the setup so that instead of a singleton each worker thread calls ORB_init(), etc.
Totally puzzled now: 'lsof' shows now 2 TCP connections but there are 6 worker threads.
Something is not right, would have expected as many TCP connections as the number of worker threads. May be that the approach is naive - does it makes sense for example to call ORB_init() per thread?
I'd need someones opinion on this. Sample code for a multithreaded client would greatly help. Again, using Orbix 6.3 on Solaris.
Kind regards,
Adrian

The management of connections is implementation specific for plain CORBA. Each vendor has its own proprietary way of configuration their behavior. If you check the RTCORBA specification, that has a standardized way to configure how connections between client and server will be used.
I don't know how Orbix works and whether it supports RTCORBA, that is something you could get from their manuals probably. I do know that TAO has a lot of support for threading at the client side. By default when multiple threads make an invocation to the same server multiple tcpip transports can be opened at the same moment.

Thank you guys for your answers. I found, as Johnny says that this is indeed implementation specific.
omniORB has for example maxGIOPConnectionPerServer - default 5. That's:
The maximum number of concurrent connections the ORB will open to a single server. If multiple threads on the client call the same server, the ORB opens additional connections to the server, up to the maximum specified by this parameter. If the maximum is reached, threads are blocked until a connection becomes free for them to use.
Unfortunately I haven't yet found out what's the equivalent (if any) for Orbix. It's definitely defaulting to 1 connection. Still googling...
Found out though that as part of Solaris -> Linux migration will be moving from Orbix to TAO in a number of months. Hoping TAO would be more friendly and customizable.

Orbix internally uses a lot of optimization routines to ensure that connections are used efficiently. Specifically, it's not going to open up multiple connections to the same server endpoint because it's able to multiplex multiple concurrent GIOP requests over the same TCP connection. CORBA deliberately hides connection management from client and server programmers.
I don't believe this is controllable through configuration. Send a support ticket to Progress Support to confirm. You might be able to force it to happen if you move away from the singleton model and initialize a different ORB for each client (each with their own unique ID), but that would be a very heavy-handed and costly solution to a problem that is a little vague. The underlying ORB is already build to optimize for concurrent requests, so I'm not sure what problem it is you're trying to solve.

In my honest opinion I don't think there is such a concept called multi threaded client for CORBA applications. Because in the server side, there is only one object that is registered with the naming service which is available for all the clients. If you look at the IOR of the object, it will be same for all the clients. So it can establishes at most only one connection to that object. It also leads to thinking that you can not get more than one remote object (which means how much ever you do look-up for the object from different clients, they all get the same reference) for any number of clients. So, in order to support mutli-threading ,the server actually has to support different thread policies. POA the server can have different thread policies. Please go through JAVA PROGRAMMING WITH CORBA for more.

I don't know how exactly Orbix works, but normally ORB initialization in done only once even for a multithreaded setup. The multithreaded (server side) ORB will start an amount of worker threads (on demand or if needed or if configured, a fixed number) to handle incomming connection. These connections are handled by a worker. This worker looks up the servant that can handle this request. Normally this (the real call to the servant) is performed in an extra thread also. But you won't see this thread with lsof. Try so use ps -eLf or top -H with thread support enabled.
EDIT:
On the client side it depends on how many object do you want to call. For each object a caller thread is possible. It is also possible to have more than one caller thread per remote object, but only if called from different threads on the client side logic. (Imagine to have multiple threads and the remote object is shared across the threads)

New thread per client connection in socket server?

I am trying to optimize multiple connections per time to a TCP socket server.
Is it considered good practice, or even rational to initiate a new thread in the listening server every time I receive a connection request?
At what time should I begin to worry about a server based on this infrastructure? What is the maximum no of background threads I can work, until it doesn't make any sense anymore?
Platform is C#, framework is Mono, target OS is CentOS, RAM is 2.4G, server is on the clouds, and I'm expecting about 200 connection requests per second.

No, you shouldn't have one thread per connection. Instead, you should be using the asynchronous methods (BeginAccept/EndAccept, BeginSend/EndSend, etc). These will make much more efficient use of system resources.
In particular, every thread you create adds overhead in terms of context switches, stack space, cache misses and so on. Linux is better at managing this stuff than Windows, for example, but that shouldn't be an excuse to give you free reign to create as many threads as you like ;)

Number of threads in a middleware application

I am writing an application server (again, non-related with a question I already posted here) and I am wondering what are the strategies to use when creating worker threads that work on the database. Some preliminary dates: the server receives xml and sends back xml, all the requests query a database - each request could take a few milliseconds to a few seconds.
Say for example that your server services a small to medium number of clients which in turn send a small number of requests per connection. Is it safe to have one worker thread per connection or should it be per request? Also should a thread pool be used to limit the resources used by the server or a worker should be added each time a new connection/request is made?
Should the server limit the number of threads it creates to an upper limit?
Hope I am not too vague ... I can hardly keep my eyes open.

If you don't have extensive experience writing application servers is a daunting task. It can be eased by using frameworks like ACE that allow you to build different configurations of your app serving infrastructure like thread per connection, thread pools, leader follower and then load the appropriate configuration with an extensible service framework.
I would recommend to read these books on ACE to get
C++ Network Programming: Mastering Complexity Using ACE and Patterns
C++ Network Programming: Systematic Reuse with ACE and Frameworks
to get an idea about what the framework can do for you.

The way I write apps like this is to make the number of threads configurable via the command line and/or a configuration file. I then do some load testing with different numbers of threads - there is always an optimal number beyond which performance begins to degrade.

If you follow the model adopted by Java EE app server developers, there's a queue for incoming requests and a pool of worker threads to service them. It's one thread per request. When a worker thread fulfills a request it goes back into the pool. If the incoming requests show up faster than the worker thread pool can service them, the queue allows them to stack up until a worker thread is released. Both the queue size and the thread pool can be tuned to match for your situation.
I'd wonder why anyone would feel the need to write their own server from scratch, especially when the scenario you describe is solved so well by others. If your wish is education, good luck. If you think you're going to improve on what's been done in the past, I'd re-examine that assumption.

How to most efficently handle large numbers of file descriptors?

There appear to be several options available to programs that handle large numbers of socket connections (such as web services, p2p systems, etc).
Spawn a separate thread to handle I/O for each socket.
Use the select system call to multiplex the I/O into a single thread.
Use the poll system call to multiplex the I/O (replacing the select).
Use the epoll system calls to avoid having to repeatedly send sockets fd's through the user/system boundaries.
Spawn a number of I/O threads that each multiplex a relatively small set of the total number of connections using the poll API.
As per #5 except using the epoll API to create a separate epoll object for each independent I/O thread.
On a multicore CPU I would expect that #5 or #6 would have the best performance, but I don't have any hard data backing this up. Searching the web turned up this page describing the experiences of the author testing approaches #2, #3 and #4 above. Unfortunately this web page appears to be around 7 years old with no obvious recent updates to be found.
So my question is which of these approaches have people found to be most efficient and/or is there another approach that works better than any of those listed above? References to real life graphs, whitepapers and/or web available writeups will be appreciated.

Speaking with my experience with running large IRC servers, we used to use select() and poll() (because epoll()/kqueue() weren't available). At around about 700 simultaneous clients, the server would be using 100% of a CPU (the irc server wasn't multithreaded). However, interestingly the server would still perform well. At around 4,000 clients, the server would start to lag.
The reason for this was that at around 700ish clients, when we'd get back to select() there would be one client available for processing. The for() loops scanning to find out which client it was would be eating up most of the CPU. As we got more clients, we'd start getting more and more clients needing processing in each call to select(), so we'd become more efficient.
Moving to epoll()/kqueue(), similar spec'd machines would trivially deal with 10,000 clients, with some (admitidly more powerful machines, but still machines that would be considered tiny by todays standards), have held 30,000 clients without breaking a sweat.
Experiments I've seen with SIGIO seem to suggest it works well for applications where latency is extremely important, where there are only a few active clients doing very little individual work.
I'd recommend using epoll()/kqueue() over select()/poll() in almost any situation. I've not experimented with splitting clients between threads. To be honest, I've never found a service that needed more optimsation work done on the front end client processing to justify the experimentation with threads.

I have spent the 2 last years working on that specific issue (for the G-WAN web server, which comes with MANY benchmarks and charts exposing all this).
The model that works best under Linux is epoll with one event queue (and, for heavy processing, several worker threads).
If you have little processing (low processing latency) then using one thread will be faster using several threads.
The reason for this is that epoll does not scale on multi-Core CPUs (using several concurrent epoll queues for connection I/O in the same user-mode application will just slow-down your server).
I did not look seriously at epoll's code in the kernel (I only focussed on user-mode so far) but my guess is that the epoll implementation in the kernel is crippled by locks.
This is why using several threads quickly hit the wall.
It goes without saying that such a poor state of things should not last if Linux wants to keep its position as one of the best performing kernels.

From my experience, you'll have the best perf with #6.
I also recommend you look into libevent to deal with abstracting some of these details away. At the very least, you'll be able to see some of their benchmark .
Also, about how many sockets are you talking about? Your approach probably doesn't matter too much until you start getting at least a few hundred sockets.

I use epoll() extensively, and it performs well. I routinely have thousands of sockets active, and test with up to 131,072 sockets. And epoll() can always handle it.
I use multiple threads, each of which poll on a subset of sockets. This complicates the code, but takes full advantage of multi-core CPUs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string