MSMQ for Managing Threads? - multithreading

I am building an application where I have inputs from printers over the network (on specific ports) and other files which are created into a folder locally or through the network. The user can create different threads to monitor different folders at the same time, as well as threads to handle the input from threes printers over the network. The application is supposed to process the input data according to its type and output it. On the other end of the application, there would be 4 threads waiting for input data from the input threads (could be 10 or 20 threads) to process and apply 4 different tasks.
As we will have many threads running at the same time, I thought I would use MSMQ to manage these threads. Does using MSMQ fit in this scenario or should I use another technique? Managing these threads in terms of scheduling, prioritizing, etc.
(P.S: I was thinking to build my own ThreadEngine class that will take care of all of these things until I heard about MSMQ, which am still not sure if it’s the right thing to use)

MSMQ would be useful for managing your input/output data not for your threads. .Net already has the ThreadPool, the CCR and the TPL to assist you with concurrency and multithreading so I would suggest reading up on those technologies and choosing the most appropriate one.

MSMQ is a system message queue, not a thread pool manager.

This could be interesting in a case where you don't really mind poor performance and are really going for a system where tasks are persistent and transactional to guarantee execution.
If you are looking for performance then I agree with other folks and highly discourage you from doing this - even with non-durable (ram queues).

Related

Performance implications of distribution of threads amongst processes in a server

The question title is pretty awkward, sorry about that.
I am currently working on the design of a server, and a comment came up from one of my co-workers that we should use multiple processes, since the was some performance hit to having too many threads in a single process (as opposed to having that same number of threads spread over multiple processes on the same machine)
The only thing I can think of which would cause this (other than bad OS scheduling), would be from increased contention (for example on the memory allocator), but I'm not sure how much that matters.
Is this a 'best practice'? Does anyone have some benchmarks they could share with me? Of course the answer may depend on the platform (I'm interested mostly in windows/linux/osx, although I need to care about HP-UX, AIX, and Solaris to some extent)
There are of course other benefits to using a multi-process architecture, such as process isolation to limit the effect of a crash, but I'm interested about performance for this question.
For some context, the server is going to service long-running, stateful connections (so they cannot be migrated to other server processes) which send back a lot of data, and can also cause a lot of local DB processing on the server machine. It's going to use the proactor architecture in-process and be implemented in C++. The server will be expected to run for weeks/months without need of restart (although this may be implemented by rotating new instances transparently under some proxy).
Also, we will be using a multi-process architecture, my concern is more about scheduling connections to processes.

Seeking tutorials and information on load-balancing between threads

I know the term "Load Balancing" can be very broad, but the subject I'm trying to explain is more specific, and I don't know the proper terminology. What I'm building is a set of Server/Client applications. The server needs to be able to handle a massive amount of data transfer, as well as client connections, so I started looking into multi-threading.
There's essentially 3 ways I can see implementing any sort of threading for the server...
One thread handling all requests (defeats the purpose of a thread if 500 clients are logged in)
One thread per user (which is risky to create 1 thread for each of the 500 clients)
Pool of threads which divide the work evenly for any number of clients (What I'm seeking)
The third one is what I'd like to know. This consists of a setup like this:
Maximum 250 threads running at once
500 clients will not create 500 threads, but share the 250
A Queue of requests will be pending to be passed into a thread
A thread is not tied down to a client, and vice-versa
Server decides which thread to send a request to based on activity (load balance)
I'm currently not seeking any code quite yet, but information on how a setup like this works, and preferably a tutorial to accomplish this in Delphi (XE2). Even a proper word or name to put on this subject would be sufficient so I can do the searching myself.
EDIT
I found it necessary to explain a little about what this will be used for. I will be streaming both commands and images, there will be a double-socket setup where there's one "Main Command Socket" and another "Add-on Image Streaming Socket". So really one connection is 2 socket connections.
Each connection to the server's main socket creates (or re-uses) an object representing all the data needed for that connection, including threads, images, settings, etc. For every connection to the main socket, a streaming socket is also connected. It's not always streaming images, but the command socket is always ready.
The point is that I already have a threading mechanism in my current setup (1 thread per session object) and I'd like to shift that over to a pool-like multithreading environment. The two connections together require a higher-level control over these threads, and I can't rely on something like Indy to keep these synchronized, I'd rather know how things are working than to learn to trust something else to do the work for me.
IOCP server. It's the only high-performance solution. It's essentially asynchronous in user mode, ('overlapped I/O in M$-speak), a pool of threads issue WSARecv, WSASend, AcceptEx calls and then all wait on an IOCP queue for completion records. When something useful happens, a kernel threadpool performs the actual I/O and then queues up the completion records.
You need at least a buffer class and socket class, (and probably others for high-performance - objectPool and pooledObject classes so you can make socket and buffer pools).
500 threads may not be an issue on a server class computer. A blocking TCP thread doesn't do much while it's waiting for the server to respond.
There's nothing stopping you from creating some type of work queue on the server side, served by a limited size pool of threads. A simple thread-safe TList works great as a queue, and you can easily put a message handler on each server thread for notifications.
Still, at some point you may have too much work, or too many threads, for the server to handle. This is usually handled by adding another application server.
To ensure scalability, code for the idea of multiple servers, and you can keep scaling by adding hardware.
There may be some reason to limit the number of actual work threads, such as limiting lock contention on a database, or something similar, however, in general, you distribute work by adding threads, and let the hardware (CPU, redirector, switch, NAS, etc.) schedule the load.
Your implementation is completely tied to the communications components you use. If you use Indy, or anything based on Indy, it is one thread per connection - period! There is no way to change this. Indy will scale to 100's of connections, but not 1000's. Your best hope to use thread pools with your communications components is IOCP, but here your choices are limited by the lack of third-party components. I have done all the investigation before and you can see my question at stackoverflow.com/questions/7150093/scalable-delphi-tcp-server-implementation.
I have a fully working distributed development framework (threading and comms) that has been used in production for over 3 years now across more than a half-dozen separate systems and basically covers everything you have asked so far. The code can be found on the web as well.

What Use are Threads Outside of Parallel Problems on MultiCore Systems?

Threads make the design, implementation and debugging of a program significantly more difficult.
Yet many people seem to think that every task in a program that can be threaded should be threaded, even on a single core system.
I can understand threading something like an MPEG2 decoder that's going to run on a multicore cpu ( which I've done ), but what can justify the significant development costs threading entails when you're talking about a single core system or even a multicore system if your task doesn't gain significant performance from a parallel implementation?
Or more succinctly, what kinds of non-performance related problems justify threading?
Edit
Well I just ran across one instance that's not CPU limited but threads make a big difference:
TCP, HTTP and the Multi-Threading Sweet Spot
Multiple threads are pretty useful when trying to max out your bandwidth to another peer over a high latency network connection. Non-blocking I/O would use significantly less local CPU resources, but would be much more difficult to design and implement.
Performing a CPU intensive task without blocking the user interface, for example.
Any application in which you may be waiting around for a resource (for example, blocking I/O from network sockets or disk devices) can benefit from threading.
In that case the thread blocking on the slow operation can be put to sleep while other threads continue to run (including, under some operating systems, the GUI thread which, if the OS cannot contact it for a while, will offer the use the chance to destroy it, thinking it's deadlocked somehow).
So it's not just for multi-core machines at all.
An interesting example is a webserver - you need to be able to handle multiple incoming connections that have nothing to do with each other.
what kinds of non-performance related
problems justify threading?
Web applications are the classic example. Each user request is conceptually a new thread. Nothing to do with performance, it's just a natural fit for the design.
Blocking code is usually much simpler to write and easier to read (and therefore maintain) than non-blocking code. Yet, using blocking code limits you to a single execution path and also locks out things like user interface (mentioned) and other IO ports. Threading is an elegant solution in these cases.
Another case when multithreading is to be considered is when you have several near-synchronous IO channels that should be managed: using multiple threads (and usually a local message queue) allows for much clearer code.
Here are a couple of specific and simple scenarios where I have launched threads...
A long running report request by the user. When the report is submitted, it is placed in a queue to be processed by a separate thread. The user can then go on within the application and check back later to see the status of their report, they aren't left with a "Processing..." page or icon.
A thread that iterates cache storage, removing data that has expired or no longer needed. The thread's job within the application is independent of any processing for a specific user, but part of the overall application run-time maintenance.
although, not specifically a threading scenario, logging within our web site is handed off to a parallel process, so the throughput of the web site isn't hindered by the time it takes to record log data.
I agree that threading just for threadings sake isn't a good idea and it can introduce problems within your application if isn't done properly, but it is an extremely useful tool for solving some problems.
Whenever you need to call some external component (be it a database query, a 3. party library, an operating system primitive etc.) that only provides a synchronous/blocking interface or using the asynchronous interface not worth the extra trouble and pain - and you also need some form of concurrency - e.g. serving multiple clients in a server or keep the GUI still responsive.
Well, how do you know if you're app is going to run on a multi-core system or not?
Beyond that, there are a lot of processes that take up time, but don't require the CPU. Such as writing to a disk or networking. Who wants to push a button in a GUI and then have to sit there and wait for a network connection. Even on a single core machine, having a separate IO thread greatly improves user experience. You always at least want a separate thread for the UI.
Yet many people seem to think that
every task in a program that can be
threaded should be threaded, even on a
single core system.
"Many people"... Who?
Also from my experience many many programs that should be multithreaded aren't (especially games.. I have an i7 and yet most games still use only 1 of my cores), so I'm not sure what you're talking about. Definitely programs like calc.exe are not multithread (or, if they are, 1 thread does 99% of the work).
Performing a CPU intensive task
without blocking the user interface,
for example.
Yes, this is true but this is fairly easy to implement and it's not what the OP is referring to (since, in this case, 1 thread does almost all the work and you only need very few mutexes)

Are message queues obsolete in linux?

I've been playing with message queues (System V, but POSIX should be ok too) in Linux recently and they seem perfect for my application, but after reading The Art of Unix Programming I'm not sure if they are really a good choice.
http://www.faqs.org/docs/artu/ch07s02.html#id2922148
The upper, message-passing layer of System V IPC has largely fallen out of use. The lower layer, which consists of shared memory and semaphores, still has significant applications under circumstances in which one needs to do mutual-exclusion locking and some global data sharing among processes running on the same machine. These System V shared memory facilities evolved into the POSIX shared-memory API, supported under Linux, the BSDs, MacOS X and Windows, but not classic MacOS.
http://www.faqs.org/docs/artu/ch07s03.html#id2923376
The System V IPC facilities are present in Linux and other modern Unixes. However, as they are a legacy feature, they are not exercised very often. The Linux version is still known to have bugs as of mid-2003. Nobody seems to care enough to fix them.
Are the System V message queues still buggy in more recent Linux versions? I'm not sure if the author means that POSIX message queues should be ok?
It seems that sockets are the preferred IPC for almost anything(?), but I cannot see how it would be very simple to implement message queues with sockets or something else. Or am I thinking too complexly?
I don't know if it's relevant that I'm working with embedded Linux?
Personally I am quite fond of message queues and think they are arguably the most under-utilized IPC in the unix world. They are fast and easy to use.
A couple of thoughts:
Some of this is just fashion. Old things become new again. Add a shiny do-dad on message queues and they may be next year's newest and hottest thing. Look at Google's Chrome using separate processes instead of threads for its tabs. Suddenly people are thrilled that when one tab locks up it doesn't bring down the entire browser.
Shared memory has something of a He-man halo about it. You're not a "real" programmer if you aren't squeezing that last cycle out of the machine and MQs are marginally less efficient. For many, if not most apps, it is utter nonsense but sometimes it is hard to break a mindset once it takes hold.
MQs really aren't appropriate for applications with unbounded data. Stream oriented mechanisms like pipes or sockets are just easier to use for that.
The System V variants really have fallen out of favor. As a general rule go with POSIX versions of IPC when you can.
Yes, I think that message queues are appropriate for some applications. POSIX message queues provide a nicer interface, in particular, you get to give your queues names rather than IDs, which is very useful for fault diagnosis (makes it easier to see which is which).
Linux allows you to mount the posix message queues as a filesystem and see them with "ls", delete them with "rm" which is quite handy too (System V depends on the clunky "ipcs" and "ipcrm" commands)
I haven't actually used POSIX message queues because I always want to leave open the option to distribute my messages across a network. With that in mind, you might look at a more robust message-passing interface like zeromq or something that implements AMQP.
One of the nice things about 0mq is that when used from the same process space in a multithreaded app, it uses a lockless zero-copy mechanism that is quite fast. Still, you can use the same interface to pass messages over a network as well.
Biggest disadvantages of POSIX message queue:
POSIX message queue does not make it a requirement to be compatible with select().(It works with select() in Linux but not in Qnx system)
It has surprises.
Unix Datagram socket does the same task of POSIX message queue. And Unix Datagram socket works in socket layer. It is possible to use it with select()/poll() or other IO-wait methods. Using select()/poll() has the advantage when designing event-based system. It is possible to avoid busy loop in that way.
There is surprise in message queue. Think about mq_notify(). It is used to get receive-event. It sounds like we can notify something about the message queue. But it is actually registering for notification instead of notifying anything.
More surprise about mq_notify() is that it has to be called after every mq_receive(), which may cause a race-condition(when some other process/thread call mq_send() between the call of mq_receive() and mq_notify()).
And it has a whole set of mq_open, mq_send(), mq_receive() and mq_close() with their own definition which is redundant and in some case inconsistent with socket open(),send(),recv() and close() method specification.
I do not think message queue should be used for synchronization. eventfd and signalfd are suitable for that.
But it(POSIX message queue) has some realtime support. It has priority features.
Messages are placed on the queue in decreasing order of priority, with newer messages of the same priority being placed after older messages with the same priority.
But this priority is also available for socket as out-of-band data !
Finally, to me , POSIX message queue is a legacy API. I always prefer Unix Datagram socket instead of POSIX message queue as long as the real-time features are not needed.
Message queues are very useful to build local decoupled applications. They are super fast, they are block organized (no need for buffering, cutting, etc which is the case for streaming sockets), basically few memcpy() operations (user code copy block to kernel, and kernel copy block to other process reading from q), and that's the story for message delivery. Some industry known middlewares such as Oracle Tuxedo or Mavimax Enduro/X uses these queues to help to build load balanced, high performance, fault tolerant decomposed, distributed applications. These queues allows to do load balancing, when several executables reads from the same queue, and kernel scheduler just distributes the message to processes which ever is idling. The nice thing for Linux is that poll can be done on Posix queues, which helps a to solve certain scenarios. For IBM AIX it is possible to do poll on System V queues.
For example, two processes can communicate easily locally over the queues with quite impressive throughput (~70k req+rply/sec):
If networking is needed, then for example Enduro/X provides tpbridge process which basically reads from messages from local queue, sends blocks to some other machine, where the other end injects the messages back in the local queue.
Also when comparing to sockets, you do not get any issues with queues, such as busy/lingering sockets when for example some binary have crashed, i.e. program at startup can immediately start to read the queues and do the processing.

How to most efficently handle large numbers of file descriptors?

There appear to be several options available to programs that handle large numbers of socket connections (such as web services, p2p systems, etc).
Spawn a separate thread to handle I/O for each socket.
Use the select system call to multiplex the I/O into a single thread.
Use the poll system call to multiplex the I/O (replacing the select).
Use the epoll system calls to avoid having to repeatedly send sockets fd's through the user/system boundaries.
Spawn a number of I/O threads that each multiplex a relatively small set of the total number of connections using the poll API.
As per #5 except using the epoll API to create a separate epoll object for each independent I/O thread.
On a multicore CPU I would expect that #5 or #6 would have the best performance, but I don't have any hard data backing this up. Searching the web turned up this page describing the experiences of the author testing approaches #2, #3 and #4 above. Unfortunately this web page appears to be around 7 years old with no obvious recent updates to be found.
So my question is which of these approaches have people found to be most efficient and/or is there another approach that works better than any of those listed above? References to real life graphs, whitepapers and/or web available writeups will be appreciated.
Speaking with my experience with running large IRC servers, we used to use select() and poll() (because epoll()/kqueue() weren't available). At around about 700 simultaneous clients, the server would be using 100% of a CPU (the irc server wasn't multithreaded). However, interestingly the server would still perform well. At around 4,000 clients, the server would start to lag.
The reason for this was that at around 700ish clients, when we'd get back to select() there would be one client available for processing. The for() loops scanning to find out which client it was would be eating up most of the CPU. As we got more clients, we'd start getting more and more clients needing processing in each call to select(), so we'd become more efficient.
Moving to epoll()/kqueue(), similar spec'd machines would trivially deal with 10,000 clients, with some (admitidly more powerful machines, but still machines that would be considered tiny by todays standards), have held 30,000 clients without breaking a sweat.
Experiments I've seen with SIGIO seem to suggest it works well for applications where latency is extremely important, where there are only a few active clients doing very little individual work.
I'd recommend using epoll()/kqueue() over select()/poll() in almost any situation. I've not experimented with splitting clients between threads. To be honest, I've never found a service that needed more optimsation work done on the front end client processing to justify the experimentation with threads.
I have spent the 2 last years working on that specific issue (for the G-WAN web server, which comes with MANY benchmarks and charts exposing all this).
The model that works best under Linux is epoll with one event queue (and, for heavy processing, several worker threads).
If you have little processing (low processing latency) then using one thread will be faster using several threads.
The reason for this is that epoll does not scale on multi-Core CPUs (using several concurrent epoll queues for connection I/O in the same user-mode application will just slow-down your server).
I did not look seriously at epoll's code in the kernel (I only focussed on user-mode so far) but my guess is that the epoll implementation in the kernel is crippled by locks.
This is why using several threads quickly hit the wall.
It goes without saying that such a poor state of things should not last if Linux wants to keep its position as one of the best performing kernels.
From my experience, you'll have the best perf with #6.
I also recommend you look into libevent to deal with abstracting some of these details away. At the very least, you'll be able to see some of their benchmark .
Also, about how many sockets are you talking about? Your approach probably doesn't matter too much until you start getting at least a few hundred sockets.
I use epoll() extensively, and it performs well. I routinely have thousands of sockets active, and test with up to 131,072 sockets. And epoll() can always handle it.
I use multiple threads, each of which poll on a subset of sockets. This complicates the code, but takes full advantage of multi-core CPUs.

Resources