High Performance Socket Server using Perl - linux

i need to write a socket server using perl which will run on a 64bit linux (2.6x kernel). Is there a library to support IO Completion Ports and some equivalent on Linux?
I need to listen to multiple ports. 8000-8100 is there a smart way doing this?
The protocol has to use a length byte.
What threading library do you recommend? I have written something similar on Windows using a cooperative multitasking based threadscheduler. i mean i want to avoid creating for each socket a thread to handle more than 10.000 simultaneous conenctions.
thanks in advance.

Threading in Perl is generally not adviced.
Instead, for high performance, you should consider looking into non blocking or event driven programming.
With regular sockets, your process blocks every IO operation, i.e reading from a socket that isn't ready will put your process to sleep until data is available. with non blocking/event driven you poll the sockets and get callbacks when the sockets are ready to be read from or written to, so a single process can multiplex on many sockets, thus providing good scalable performance since you don't need to fork new processes to handle more clients.
There are many good event based frameworks in Perl, e.g POE and AnyEvent POE is a specific event loop with lots of modules and features and AnyEvent is an abstraction layer that lets you use multiple event loops in the same code.
You should also look into libev which is similar to POE but with a lot less overhead.
Writing event driven code is somewhat tricky at first, since you need to be careful with the blocking code you do have, e.g cpu intensive operation, or using libraries which aren't non blocking. because since you have only one process, if it's busy doing something, it can't do anything else - like poll on the sockets and issue callbacks.
So, if you need both non blocking and intensive computations, one way to do it is to create worker forks and use non blocking pipes to communicate between them and your event loop, which is really straight forward with the above libraries.

Related

libuv vs. raw epoll or IOCP

I'm writing the IO core for a messaging library and considering libuv vs. using raw epoll on linux and IOCP on windows (and eventually others, solaris events etc.) I like the portability of libuv, I'm looking at the performance.
epoll and IOCP allow multiple threads to wait directly for IO events, the kernel does the dispatching. Potentially more efficient than user-space dispatching though I don't have any numbers.
libuv (based on my reading) has a thread-usafe event loop, but I could implement a leader-follower thread pool. By that I mean one thread (at a time) is the "leader" waiting for events. When the leader gets an event it signals that a follower should take over as leader. The ex-leader process the event and then becomes a follower.
My hope is that should be close in performance to raw multi-threaded epoll/IOCP, assuming libuv is efficiently implemented. I will do my own measurements, but I'd like to hear from anyone with experience.
Disclaimer: I'm one of the maintainers of libuv.
I'd advise you start with libuv. If only because it contains hunderds of corner cases already taken care of, and tons of knowledge built in. If you want to support other platforms you are going to end up reinventing libuv one way or another :-)
Once you have built your server prototype with libuv run some benchmarks and see where the bottleneck is. Depending on the type of server you are writing you may need / want a multi-threaded event loop, which libuv isn't, but chances are libuv will work fine enough for you.

What really is asynchronous computing?

I've been reading (and working) quite a bit with massively multi-threaded applications, and with IO, and I've found that the term asynchronous has become some sort of catch-all for multiple vague ideas. I'm wondering if I understand it correctly. The way I see it is that there are two main branches of "asynchronicity".
Asynchronous I/O. Such as network read/write. What this really boils down to is efficient parallel processing between multiple CPUs, such as your main CPU and your NIC CPU. The idea is to have multiple processors running in parallel, exchanging data, without blocking waiting for the other to finish and return the results of it's job.
Minimizing context-switching penalties by minimizing use of threads. This seems to be what the .NET framework is focusing on with it's async/await features. Instead of spawning/closing/blocking threads, break parallel jobs into tasks, and use a software task scheduler to keep a pool of threads as busy as possible without resorting to spawning new threads.
These seem like two entirely separate concepts with no similarities that could tie them together, but are both referred to by the same "asynchronous computing" vocabulary.
Am I understanding all of this correctly?
Asynchronous basically means not blocking, i.e. not having to wait for an operation to complete.
Threads are just one way of accomplishing that. There are many ways of doing this, from hardware level, SO level, software level.
Someone with more experience than me can give examples of asyncronicity not related to threads.
What this really boils down to is efficient parallel processing between multiple CPUs, such as your main CPU and your NIC CPU. The idea is to have multiple processors running in parallel...
Asynchronous programming is not all about multi-core CPU's and parallelism: consider a single core CPU, with just one thread creating email messages and sends them. In a synchronous fashion, it would spend a few micro seconds to create the message, and a lot more time to send it through network, and only then create the next message. But in asynchronous program, the thread could create a new message while the previous one is being sent through the network. One implementation for that kind of program can be using .NET async/await feature, where you can have just one thread. But even a blocking IO program could be considered asynchronous: If the main thread creates the messages and queues them in a buffer, which another thread pulls them from and sends them in a blocking IO way. From the main thread's point of view - it's completely async.
.NET async/await just uses the OS api's which are already async - reading /writing a file, send /receive data through network, they are all async anyway - the OS doesn't block on them (the drivers themselves are async).
Asynchronous is a general term, which does not have widely accepted meaning. Different domains have different meanings to it.
For instance, async IO means that instead of blocking on IO call, something else happens. Something else can be really different things, but it usually involves some sort of notification of call completion. Details might differ. For instance, a notification might be built into the call itself - like in MS Completeion Ports (if memory serves). Or, it can be something verify do before you make a call so that the call can not block - this is what poll() and friends do.
Async might also well mean simply parallel execution. For instance, one might say that 'database is updated asynchronously' meaning that there is a dedicated thread which handles database connectivity, and that thread does not slow down the main processing thread.

Linux IPC with single writer, multible readers

In my application, I have one process (the "writer") reading data from external hardware. This process should supply consistent packets to several "readers".
Question:
How can one process send consistent data-packets (not blocking) to several clients while they see something like the end of a personal fifo? I'm on Debian Linux.
1) In my first approach I tried "datagram - unix domain sockets", which worked well.
But with the "writer" as server, all the clients have to poll the server permanently. :-(
They get one packet several times; or miss a packet if not polling fast enough.
2) My second approach was FIFOs (Named Pipes) which works too, but with several readers "strange things happen", what I got confirmed here: http://beej.us/guide/bgipc/output/html/multipage/fifos.html
I tried this, searched the net and stackoverflow the whole day, but I could not find a reasonable answer.
Edit: Sorry, I did not mention: I can't use socketpair() and fork. My programs are independently developed. I'd like to have the writer ready, while developing new readers.
If the writer process is a server, it might fork the client processes and just plain pipe(2) for communication. If there is no parent/child relationship, consider named pipes made with mkfifo(3), or AF_UNIX sockets (see unix(7) and scoket(2) ....) which are bidirectional (AF_UNIX sockets are much faster than TCP/IP or UDP/IP on the same machine).
Notice that your writer process is reading data from your hardware device and is writing or sending data to several reader clients. So your writer process deals with many file descriptors simultaneously (the hardware device to be read, and the sockets or pipes to be written to the clients, at least one file descriptor per client).
However, the important thing is to have some event loop (notably on the server side, and probably also inside clients). This means that you call some multiplexing syscall like poll(2) in the loop, and you "decide" if you are reading or writing or connecting (and which file descriptor should be read, or should be written, or should connect) in each iteration. See also read(2), write(2), connect(2), send(2), recv(2) etc... Notice that you should buffer data with an event loop (since read and write can be on "partial" or "incomplete" messages).
Notice that poll is not eating CPU resources when waiting for I/O. You could, but you should not anymore, use some older multiplexing syscall (like the obsolete select(2) ...). Use poll(2).
You may want to use a library to provide the event loop, e.g. libevent or libev... See also this answer. That event loop should also (on the server side) poll then read the hardware device.
If some of the programs are using a GUI toolkit (e.g. on the client side) like Qt or Gtk they should profit of the existing event loop provided by that toolkit...
You should read Advanced Linux Programming and know about the C10K problem.
If signals or timers are important (read carefully signal(7) and time(7)) the Linux specific signalfd(2) and timerfd_create(2) could be very helpful since they play nicely with event loops. These linux specific syscalls (signalfd & timerfd_create ...) are too recent to be mentioned in Advanced Linux Programming.
BTW, you could study the source code of existing free software similar to yours, and/or use strace(1) to understand the exact syscalls that they are doing.
If you have no loop around a multiplexing syscall (à la poll(2)) then you have no event loop and your design is buggy and cannot possibly and reliably work (since you need to react to several file descriptors at once).
You could also use a multi-threaded approach, but it is much more complex and not worth the effort in your particular case.
ZeroMQ has pattern for this problem. Is fast and supports a lot of programming languages. Pub-Sub. See: https://zeromq.org/ (free and open source).

Would handling each TCP connection in a separate thread improve latency?

I have an FTP server, implemented on top of QTcpServer and QTcpSocket.
I take advantage of the signals and slots mechanism to support multiple TCP connections simultaneously, even though I have a single thread. My code returns as soon as possible to the event loop, it doesn't block (no wait functions), and it doesn't use nested event loops anywhere. That way I already have cooperative multitasking, like Win3.1 applications had.
But a lot of other FTP servers are multithreaded. Now I'm wondering if using a separate thread for handling each TCP connection would improve performance, and especially latency.
On one hand, threads add to latency because you need to start a new thread for each new connection, but on the other, with my cooperative multitasking, other TCP connections have to wait until I've returned to the main loop before their readyRead()/bytesWritten() signals can be handled.
In your current system and ignoring file I/O time one processor is always doing something useful if there's something useful to be done, and waiting ready-to-go if there's nothing useful to be done. If this were a single processor (single core) system you would have maximized throughput. This is often a very good design -- particularly for an FTP server where you don't usually have a human waiting on a packet-by-packet basis.
You have also minimized average latency (for a single processor system.) What you do not have is consistent latency. Measuring your system's performance is likely to show a lot of jitter -- a lot of variation in the time it takes to handle a packet. Again because this is FTP and not real-time process control or human interaction, jitter may not be a problem.
Now, however consider that there is probably more than one processor available on your system and that it may be possible to overlap I/O time and processing time.
To take full advantage of a multi-processor(core) system you need some concurrency.
This normally translates to using multiple threads, but it may be possible to achieve concurrency via asynchronous (non-blocking) file reads and writes.
However, adding multiple threads to a program opens up a huge can-of-worms.
If you do decide to go the MT route, I'd suggest that you consider depending on a thread-aware I/O library. QT may provide that for you (I'm not sure.) If not, take a look at boost::asio (or ACE for an older, but still solid solution). You'll discover that using the MT capabilities of such a library involves a considerable investment in learning time; however as it turns out the time to add on multithreading "by-hand" and get it right is even worse.
So I'd say stay with your existing solution unless you are worried about unused Processor cycles and/or jitter in which case start learning QT's multithreading support or boost::asio.
Do you need to start a new thread for each new connection? Could you not just have a pool of threads that acts on requests as and when they arrive. This should reduce some of the latency. I have to say that in general a multi-threaded FTP server should be more responsive that a single-threaded one. Is it possible to have an event based FTP server?

Instant Messaging Server Design

Let's suppose we have an instant messaging application, client-server based, not p2p. The actual protocol doesn't matter, what matters is the server architecture. The said server can be coded to operate in single-threaded, non-parallel mode using non-blocking sockets, which by definition allow us to perform operations like read-write effectively immediately (or instantly). This very feature of non-blocking sockets allows us to use some sort of select/poll function at the very core of the server and waste next to no time in the actual socket read/write operations, but rather to spend time processing all this information. Properly coded, this can be very fast, as far as I understand. But there is the second approach, and that is to multithread aggressively, creating a new thread (obviously using some sort of thread pool, because that very operation can be (very) slow on some platforms and under some circumstances), and letting those threads to work in parallel, while the main background thread handles accept() and stuff. I've seen this approach explained in various places over the Net, so it obviously does exist.
Now the question is, if we have non-blocking sockets, and immediate read/write operations, and a simple, easily coded design, why does the second variant even exist? What problems are we trying to overcome with the second design, i.e. threads? AFAIK those are usually used to work around some slow and possibly blocking operations, but no such operations seem to be present there!
I'm assuming you're not talking about having a thread per client as such a design is usually for completely diffreent reasons, but rather a pool of threads each handles several concurrent clients.
The reason for that arcitecture vs a single threaded server is simply to take advantage of multiple processors. You're doing more work than simply I/O. You have to parse the messages, do various work, maybe even run some more heavyweight crypto algorithms. All this takes CPU. If you want to scale, taking advantage of multiple processors will allow you to scale even more, and/or keep the latency even lower per client.
Some of the gain in such a design can be a bit offset by the fact you might need more locking in a multithreaded environment, but done right, and certainly depening on what you're doing, it can be a huge win - at the expense of more complexity.
Also, this might help overcome OS limitations . The I/O paths in the kernel might get more distributed among the procesors. Not all operating systems might fully be able to thread the IO from a single threaded applications. Back in the old days there were'nt all the great alternatives to the old *nix select(), which usually had a filedesciptor limit of 1024, and similar APIs severly started degrading once you told it to monitor too many socket. Spreading all those clients on multiple threads or processes helped overcome that limit.
As for a 1:1 mapping between threads, there's several reasons to implement that architecture:
Easier programming model, which might lead to less hard to find bugs, and faster to implement.
Support blocking APIs. These are all over the place. Having a thread handle many/all of the clients and then go on to do a blocking call to a database is going to stall everyone. Even reading files can block your application, and you usually can't monitor regular file handles/descriptors for IO events - or when you can, the programming model is often exceptionally complicated.
The drawback here is it won't scale, atleast not with the most widely used languages/framework. Having thousands of native threads will hurt performance. Though some languages provides a much more lightweight approach here, such as Erlang and to some extent Go.

Resources