Linux, communication between applications - linux

In my embedded system running Linux (Ubuntu armhf) I have to communicate between processes.
I'm doing it with TCP sockets. It works great but due the high frequency of my requests I have a very high processor usage (94% average measured whit nmon).
There is a way to lower it using that kind of communication in a more efficient manner?

shared memory and message queues can be used to exchange information between processes. The difference is in how they are used. both have some advantage and disadvantage.
Shared memory
it's an area of storage that can be read and written by more than one process. It provides no inherent synchronization; in other words, it's up to the programmer to ensure that one process doesn't clobber another's data. But it's efficient in terms of throughput: reading and writing are relatively fast operations.
A message queue is a one-way pipe:
one process writes to the queue, and another reads the data in the order it was written until an end-of-data condition occurs. When the queue is created, the message size (bytes per message, usually fairly small) and queue length (maximum number of pending messages) are set. Access is slower than shared memory because each read/write operation is typically a single message. But the queue guarantees that each operation will either processes an entire message successfully or fail without altering the queue. So the writer can never fail after writing only a partial message, and the reader will either retrieve a complete message or nothing at all.

If you wish to stick with your basic architecture, you can switch from TCP sockets to Unix domain sockets (AF_UNIX/AF_LOCAL). Since it's a strictly local protocol, it doesn't have the overhead of TCP.

Related

Is there any reason to lock a queue?

I'm just wondering if there would be any reason I might want to lock a queue. I am working on an application that has several threads that reads and writes to a database. In order to reduce traffic, I want to reduce the amount of calls to that database at any given point (I know many databases can handle some traffic already). Would it make any sense to make a queue for the read/write requests and only the request at the top executes and then protect the queue's push and pop commands with a lock? Is having a lock on each read/write call enough? Isn't a lock implemented as a "queue" by the OS anyways? Could size of this "queue" be an issue or would there be any other reason I wouldn't use a lock by itself?
Thanks!
You could limit the number of threads that are engaged in database requests or if that's not feasible due to the nature of your app, you could use a more granular approach to limit access to the shared resource. In python, you can use the built-in semaphore objects for inter-thread synchronization. For inter-process synchronization (or inter-thread), you'd use posix_ipc. It depends what your service's execution model is.
Most database clients wouldn't require any application-level throttling. In a typical system, the database connections would be pooled and the connection manager would be responsible for acquiring an available connection. Internally this usually involves a queue of some sort with timeouts to prevent waiting indefinitely. The database itself would then handle the scheduling of individual operations made by each connection.
However, a semaphore is a signalling primitive that can be used to limit the number of concurrent operations: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Semaphore.html
Tasks can also be modeled as a producer-consumer problem which involves a shared queue, however you'll have to deal with the added complexity of managing the consumer threads in addition to the producers.

Seeking tutorials and information on load-balancing between threads

I know the term "Load Balancing" can be very broad, but the subject I'm trying to explain is more specific, and I don't know the proper terminology. What I'm building is a set of Server/Client applications. The server needs to be able to handle a massive amount of data transfer, as well as client connections, so I started looking into multi-threading.
There's essentially 3 ways I can see implementing any sort of threading for the server...
One thread handling all requests (defeats the purpose of a thread if 500 clients are logged in)
One thread per user (which is risky to create 1 thread for each of the 500 clients)
Pool of threads which divide the work evenly for any number of clients (What I'm seeking)
The third one is what I'd like to know. This consists of a setup like this:
Maximum 250 threads running at once
500 clients will not create 500 threads, but share the 250
A Queue of requests will be pending to be passed into a thread
A thread is not tied down to a client, and vice-versa
Server decides which thread to send a request to based on activity (load balance)
I'm currently not seeking any code quite yet, but information on how a setup like this works, and preferably a tutorial to accomplish this in Delphi (XE2). Even a proper word or name to put on this subject would be sufficient so I can do the searching myself.
EDIT
I found it necessary to explain a little about what this will be used for. I will be streaming both commands and images, there will be a double-socket setup where there's one "Main Command Socket" and another "Add-on Image Streaming Socket". So really one connection is 2 socket connections.
Each connection to the server's main socket creates (or re-uses) an object representing all the data needed for that connection, including threads, images, settings, etc. For every connection to the main socket, a streaming socket is also connected. It's not always streaming images, but the command socket is always ready.
The point is that I already have a threading mechanism in my current setup (1 thread per session object) and I'd like to shift that over to a pool-like multithreading environment. The two connections together require a higher-level control over these threads, and I can't rely on something like Indy to keep these synchronized, I'd rather know how things are working than to learn to trust something else to do the work for me.
IOCP server. It's the only high-performance solution. It's essentially asynchronous in user mode, ('overlapped I/O in M$-speak), a pool of threads issue WSARecv, WSASend, AcceptEx calls and then all wait on an IOCP queue for completion records. When something useful happens, a kernel threadpool performs the actual I/O and then queues up the completion records.
You need at least a buffer class and socket class, (and probably others for high-performance - objectPool and pooledObject classes so you can make socket and buffer pools).
500 threads may not be an issue on a server class computer. A blocking TCP thread doesn't do much while it's waiting for the server to respond.
There's nothing stopping you from creating some type of work queue on the server side, served by a limited size pool of threads. A simple thread-safe TList works great as a queue, and you can easily put a message handler on each server thread for notifications.
Still, at some point you may have too much work, or too many threads, for the server to handle. This is usually handled by adding another application server.
To ensure scalability, code for the idea of multiple servers, and you can keep scaling by adding hardware.
There may be some reason to limit the number of actual work threads, such as limiting lock contention on a database, or something similar, however, in general, you distribute work by adding threads, and let the hardware (CPU, redirector, switch, NAS, etc.) schedule the load.
Your implementation is completely tied to the communications components you use. If you use Indy, or anything based on Indy, it is one thread per connection - period! There is no way to change this. Indy will scale to 100's of connections, but not 1000's. Your best hope to use thread pools with your communications components is IOCP, but here your choices are limited by the lack of third-party components. I have done all the investigation before and you can see my question at stackoverflow.com/questions/7150093/scalable-delphi-tcp-server-implementation.
I have a fully working distributed development framework (threading and comms) that has been used in production for over 3 years now across more than a half-dozen separate systems and basically covers everything you have asked so far. The code can be found on the web as well.

Does thread has limit to use the network bandwidth?

I heard there is some limitation for a single thread to use network bandwidth? if this is true, is this the reason to use multithread programming to achieve the maximum bandwidth?
The reason to use multithreading for network tasks is that one thread might be waiting for a response from the remote server. Creating multiple threads can help you having at least one thread downloading from different requests at one time.
The usual reason for issuing more than one network request at a time, (either implicitly with user threads, or implicitly with kernel threads and asynchronous callbacks), is that the effects of network latency can be be minimised. Latency can have a large effect. A web connection, for example, needs a DNS lookup first, then a TCP 3-way connect, then some data transfer and finally a 4-way close. If the page size is small and the bandwidth large compared with the latency, most time is spent waiting for protocol exchanges.
So, if you are crawling multiple servers, a multithreaded design is hugely faster even on a single-core machine. If you are downloading a single video file from one server, not so much..

Are message queues obsolete in linux?

I've been playing with message queues (System V, but POSIX should be ok too) in Linux recently and they seem perfect for my application, but after reading The Art of Unix Programming I'm not sure if they are really a good choice.
http://www.faqs.org/docs/artu/ch07s02.html#id2922148
The upper, message-passing layer of System V IPC has largely fallen out of use. The lower layer, which consists of shared memory and semaphores, still has significant applications under circumstances in which one needs to do mutual-exclusion locking and some global data sharing among processes running on the same machine. These System V shared memory facilities evolved into the POSIX shared-memory API, supported under Linux, the BSDs, MacOS X and Windows, but not classic MacOS.
http://www.faqs.org/docs/artu/ch07s03.html#id2923376
The System V IPC facilities are present in Linux and other modern Unixes. However, as they are a legacy feature, they are not exercised very often. The Linux version is still known to have bugs as of mid-2003. Nobody seems to care enough to fix them.
Are the System V message queues still buggy in more recent Linux versions? I'm not sure if the author means that POSIX message queues should be ok?
It seems that sockets are the preferred IPC for almost anything(?), but I cannot see how it would be very simple to implement message queues with sockets or something else. Or am I thinking too complexly?
I don't know if it's relevant that I'm working with embedded Linux?
Personally I am quite fond of message queues and think they are arguably the most under-utilized IPC in the unix world. They are fast and easy to use.
A couple of thoughts:
Some of this is just fashion. Old things become new again. Add a shiny do-dad on message queues and they may be next year's newest and hottest thing. Look at Google's Chrome using separate processes instead of threads for its tabs. Suddenly people are thrilled that when one tab locks up it doesn't bring down the entire browser.
Shared memory has something of a He-man halo about it. You're not a "real" programmer if you aren't squeezing that last cycle out of the machine and MQs are marginally less efficient. For many, if not most apps, it is utter nonsense but sometimes it is hard to break a mindset once it takes hold.
MQs really aren't appropriate for applications with unbounded data. Stream oriented mechanisms like pipes or sockets are just easier to use for that.
The System V variants really have fallen out of favor. As a general rule go with POSIX versions of IPC when you can.
Yes, I think that message queues are appropriate for some applications. POSIX message queues provide a nicer interface, in particular, you get to give your queues names rather than IDs, which is very useful for fault diagnosis (makes it easier to see which is which).
Linux allows you to mount the posix message queues as a filesystem and see them with "ls", delete them with "rm" which is quite handy too (System V depends on the clunky "ipcs" and "ipcrm" commands)
I haven't actually used POSIX message queues because I always want to leave open the option to distribute my messages across a network. With that in mind, you might look at a more robust message-passing interface like zeromq or something that implements AMQP.
One of the nice things about 0mq is that when used from the same process space in a multithreaded app, it uses a lockless zero-copy mechanism that is quite fast. Still, you can use the same interface to pass messages over a network as well.
Biggest disadvantages of POSIX message queue:
POSIX message queue does not make it a requirement to be compatible with select().(It works with select() in Linux but not in Qnx system)
It has surprises.
Unix Datagram socket does the same task of POSIX message queue. And Unix Datagram socket works in socket layer. It is possible to use it with select()/poll() or other IO-wait methods. Using select()/poll() has the advantage when designing event-based system. It is possible to avoid busy loop in that way.
There is surprise in message queue. Think about mq_notify(). It is used to get receive-event. It sounds like we can notify something about the message queue. But it is actually registering for notification instead of notifying anything.
More surprise about mq_notify() is that it has to be called after every mq_receive(), which may cause a race-condition(when some other process/thread call mq_send() between the call of mq_receive() and mq_notify()).
And it has a whole set of mq_open, mq_send(), mq_receive() and mq_close() with their own definition which is redundant and in some case inconsistent with socket open(),send(),recv() and close() method specification.
I do not think message queue should be used for synchronization. eventfd and signalfd are suitable for that.
But it(POSIX message queue) has some realtime support. It has priority features.
Messages are placed on the queue in decreasing order of priority, with newer messages of the same priority being placed after older messages with the same priority.
But this priority is also available for socket as out-of-band data !
Finally, to me , POSIX message queue is a legacy API. I always prefer Unix Datagram socket instead of POSIX message queue as long as the real-time features are not needed.
Message queues are very useful to build local decoupled applications. They are super fast, they are block organized (no need for buffering, cutting, etc which is the case for streaming sockets), basically few memcpy() operations (user code copy block to kernel, and kernel copy block to other process reading from q), and that's the story for message delivery. Some industry known middlewares such as Oracle Tuxedo or Mavimax Enduro/X uses these queues to help to build load balanced, high performance, fault tolerant decomposed, distributed applications. These queues allows to do load balancing, when several executables reads from the same queue, and kernel scheduler just distributes the message to processes which ever is idling. The nice thing for Linux is that poll can be done on Posix queues, which helps a to solve certain scenarios. For IBM AIX it is possible to do poll on System V queues.
For example, two processes can communicate easily locally over the queues with quite impressive throughput (~70k req+rply/sec):
If networking is needed, then for example Enduro/X provides tpbridge process which basically reads from messages from local queue, sends blocks to some other machine, where the other end injects the messages back in the local queue.
Also when comparing to sockets, you do not get any issues with queues, such as busy/lingering sockets when for example some binary have crashed, i.e. program at startup can immediately start to read the queues and do the processing.

How to most efficently handle large numbers of file descriptors?

There appear to be several options available to programs that handle large numbers of socket connections (such as web services, p2p systems, etc).
Spawn a separate thread to handle I/O for each socket.
Use the select system call to multiplex the I/O into a single thread.
Use the poll system call to multiplex the I/O (replacing the select).
Use the epoll system calls to avoid having to repeatedly send sockets fd's through the user/system boundaries.
Spawn a number of I/O threads that each multiplex a relatively small set of the total number of connections using the poll API.
As per #5 except using the epoll API to create a separate epoll object for each independent I/O thread.
On a multicore CPU I would expect that #5 or #6 would have the best performance, but I don't have any hard data backing this up. Searching the web turned up this page describing the experiences of the author testing approaches #2, #3 and #4 above. Unfortunately this web page appears to be around 7 years old with no obvious recent updates to be found.
So my question is which of these approaches have people found to be most efficient and/or is there another approach that works better than any of those listed above? References to real life graphs, whitepapers and/or web available writeups will be appreciated.
Speaking with my experience with running large IRC servers, we used to use select() and poll() (because epoll()/kqueue() weren't available). At around about 700 simultaneous clients, the server would be using 100% of a CPU (the irc server wasn't multithreaded). However, interestingly the server would still perform well. At around 4,000 clients, the server would start to lag.
The reason for this was that at around 700ish clients, when we'd get back to select() there would be one client available for processing. The for() loops scanning to find out which client it was would be eating up most of the CPU. As we got more clients, we'd start getting more and more clients needing processing in each call to select(), so we'd become more efficient.
Moving to epoll()/kqueue(), similar spec'd machines would trivially deal with 10,000 clients, with some (admitidly more powerful machines, but still machines that would be considered tiny by todays standards), have held 30,000 clients without breaking a sweat.
Experiments I've seen with SIGIO seem to suggest it works well for applications where latency is extremely important, where there are only a few active clients doing very little individual work.
I'd recommend using epoll()/kqueue() over select()/poll() in almost any situation. I've not experimented with splitting clients between threads. To be honest, I've never found a service that needed more optimsation work done on the front end client processing to justify the experimentation with threads.
I have spent the 2 last years working on that specific issue (for the G-WAN web server, which comes with MANY benchmarks and charts exposing all this).
The model that works best under Linux is epoll with one event queue (and, for heavy processing, several worker threads).
If you have little processing (low processing latency) then using one thread will be faster using several threads.
The reason for this is that epoll does not scale on multi-Core CPUs (using several concurrent epoll queues for connection I/O in the same user-mode application will just slow-down your server).
I did not look seriously at epoll's code in the kernel (I only focussed on user-mode so far) but my guess is that the epoll implementation in the kernel is crippled by locks.
This is why using several threads quickly hit the wall.
It goes without saying that such a poor state of things should not last if Linux wants to keep its position as one of the best performing kernels.
From my experience, you'll have the best perf with #6.
I also recommend you look into libevent to deal with abstracting some of these details away. At the very least, you'll be able to see some of their benchmark .
Also, about how many sockets are you talking about? Your approach probably doesn't matter too much until you start getting at least a few hundred sockets.
I use epoll() extensively, and it performs well. I routinely have thousands of sockets active, and test with up to 131,072 sockets. And epoll() can always handle it.
I use multiple threads, each of which poll on a subset of sockets. This complicates the code, but takes full advantage of multi-core CPUs.

Resources