How many socket connections possible? - linux

Has anyone an idea how many tcp-socket connections are possible on a modern standard Linux server?
(There is in general less traffic on each connection, but all the connections have to be up all the time.)

I achieved 1600k concurrent idle socket connections, and at the same time 57k req/s on a Linux desktop (16G RAM, I7 2600 CPU). It's a single thread http server written in C with epoll. Source code is on github, a blog here.
Edit:
I did 600k concurrent HTTP connections (client & server) on both the same computer, with JAVA/Clojure . detail info post, HN discussion: http://news.ycombinator.com/item?id=5127251
The cost of a connection(with epoll):
application need some RAM per connection
TCP buffer 2 * 4k ~ 10k, or more
epoll need some memory for a file descriptor, from epoll(7)
Each registered file descriptor costs roughly 90
bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit kernel.

This depends not only on the operating system in question, but also on configuration, potentially real-time configuration.
For Linux:
cat /proc/sys/fs/file-max
will show the current maximum number of file descriptors total allowed to be opened simultaneously. Check out http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html

A limit on the number of open sockets is configurable in the /proc file system
cat /proc/sys/fs/file-max
Max for incoming connections in the OS defined by integer limits.
Linux itself allows billions of open sockets.
To use the sockets you need an application listening, e.g. a web server, and that will use a certain amount of RAM per socket.
RAM and CPU will introduce the real limits. (modern 2017, think millions not billions)
1 millions is possible, not easy. Expect to use X Gigabytes of RAM to manage 1 million sockets.
Outgoing TCP connections are limited by port numbers ~65000 per IP. You can have multiple IP addresses, but not unlimited IP addresses.
This is a limit in TCP not Linux.

10,000? 70,000? is that all :)
FreeBSD is probably the server you want, Here's a little blog post about tuning it to handle 100,000 connections, its has had some interesting features like zero-copy sockets for some time now, along with kqueue to act as a completion port mechanism.
Solaris can handle 100,000 connections back in the last century!. They say linux would be better
The best description I've come across is this presentation/paper on writing a scalable webserver. He's not afraid to say it like it is :)
Same for software: the cretins on the
application layer forced great
innovations on the OS layer. Because
Lotus Notes keeps one TCP connection
per client open, IBM contributed major
optimizations for the ”one process,
100.000 open connections” case to Linux
And the O(1) scheduler was originally
created to score well on some
irrelevant Java benchmark. The bottom
line is that this bloat benefits all of
us.

On Linux you should be looking at using epoll for async I/O. It might also be worth fine-tuning socket-buffers to not waste too much kernel space per connection.
I would guess that you should be able to reach 100k connections on a reasonable machine.

depends on the application. if there is only a few packages from each client, 100K is very easy for linux. A engineer of my team had done a test years ago, the result shows : when there is no package from client after connection established, linux epoll can watch 400k fd for readablity at cpu usage level under 50%.

Which operating system?
For windows machines, if you're writing a server to scale well, and therefore using I/O Completion Ports and async I/O, then the main limitation is the amount of non-paged pool that you're using for each active connection. This translates directly into a limit based on the amount of memory that your machine has installed (non-paged pool is a finite, fixed size amount that is based on the total memory installed).
For connections that don't see much traffic you can reduce make them more efficient by posting 'zero byte reads' which don't use non-paged pool and don't affect the locked pages limit (another potentially limited resource that may prevent you having lots of socket connections open).
Apart from that, well, you will need to profile but I've managed to get more than 70,000 concurrent connections on a modestly specified (760MB memory) server; see here http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html for more details.
Obviously if you're using a less efficient architecture such as 'thread per connection' or 'select' then you should expect to achieve less impressive figures; but, IMHO, there's simply no reason to select such architectures for windows socket servers.
Edit: see here http://blogs.technet.com/markrussinovich/archive/2009/03/26/3211216.aspx; the way that the amount of non-paged pool is calculated has changed in Vista and Server 2008 and there's now much more available.

Realistically for an application, more then 4000-5000 open sockets on a single machine becomes impractical. Just checking for activity on all the sockets and managing them starts to become a performance issue - especially in real-time environments.

Related

Is it more resource-efficient to run two threads under low load or a single thread under moderate load?

Consider a client application that is going to receive UDP packets from the same IP address, but on different ports. The bit rates of both data streams are much lower than the overall network connection throughput (say, both are around 2 Kb/s). The machine running the client is going to have a modern ARMv7 or x86-64 CPU.
Which of the following two approaches is better in terms of efficient use of the target machine’s resources?
Run single-threaded, blocking on both sockets with the epoll system call, and read from one or both sockets sequentially.
Run in two threads, each dedicated to one of the two sockets, and use simple blocking I/O (not invoking epoll).
Is there a possibility of losing packets by following the first approach when both sockets have data? Does the answer change with different number of CPU cores available on the target machine?

Need to increase the number of concurrent HTTP connections to 85000

I have a setup with 2 machines. I am using one as the server and the other as client. They are connected directly using a 1Ghz link. Both the machines have 4 cores, 8Gb ram and almost 100Gb disk space. I need to tune the Nginx server ( its the one im trying with but i can use any other as well) to handle 85000 concurrent connections. I have a 1kb file on the server and i am using curl on the client to get the same file over all the connections.
After trying various tuning settings, i have 1500 established connections and around 30000 TIME_WAIT connections when i call the curl around 40000 times. Is there a way i can make the TIME_WAITs ESTABLISHED?
Any help in tuning both the server and client will be much appreciated. I am pretty new to using Linux and trying to get the hang of it. The version of linux on both machines is Fedora 20.
Besides of tuning Nginx, you will also need to tune your Linux installation in respect to limits in number of tcp connections, sockets, open files, etc.
These two links should give you a great overview:
https://www.nginx.com/blog/tuning-nginx/
https://mrotaru.wordpress.com/2013/10/10/scaling-to-12-million-concurrent-connections-how-migratorydata-did-it/
You might want to check how much memory TCP buffers etc are using for all those connections.
See this SO thread: How much memory is consumed by the Linux kernel per TCP/IP network connection?
Also, this page is good: http://www.psc.edu/index.php/networking/641-tcp-tune
Given that your two machines are one the same physical network and delays are very low, you can use fairly small TCP window buffer sizes. Modern Linuxes (you didn't mention what kernel you're using) have TCP Autotuning that automatically adjusts these buffers, so you should not have to worry about this unless you're using an old kernel.
Regardless, however, the application(s) can allocate send- and receive buffers separately, which disables TCP Autotuning, so if you're running an application that does this, you might want to limit how much buffer space an application can request per connection (the net.core.wmem_max and net.core.rmem_max variables mentioned in the SO article).
I would recommend https://github.com/eunyoung14/mtcp to achieve 1 million concurrent connection, I did some tuning of mtcp and tested it on a used Dell PowerEdge R210 with 32G ram and 8 cores to achieve 1 million concurrent connection.

The Windows desktop becomes paralysed during heavy network I/O / Windows kernel allocates only 1 out of many CPUs?

Problem: We implement a video recording system on a Windows Server 2012 system. In spite of low CPU and memory consumption, we face serious performance problems.
Short program description: the application (VS2005/C++) creates many network sockets, each receiving a multicast UDP video stream from an Ethernet network. Per stream the application provides a receiver buffer by calling WSARecvFrom() (overlapped operation), waits in MsgWaitForMultipleObjects() for the Window's "data arrived" event, takes the data packet, and repeats all again in an endless loop. For testing, to assure minimal CPU and memory consumption beside the pure socket IO work, the application does nothing, neither any disk/file IO. The application process is configured to use all available cores on the machine (default affinity settings unchanged).
Tests run: the test is run on two different machines: a) a Windows 7 with 4 physical cores / 8 with hyper-threading, and b) a Windows Server 2012 with 12 physical cores / 24 with hyper-threading.
Both systems show the same problem: everything works fine up to a certain number of configured sockets / network streams. Increasing them further (and we need to) finally paralyses the Windows desktop (mouse-pointer, repainting). At this stage the total CPU load is still very low (i.e. 10-15%) and there is much free memory available. But the Task-Manager shows extremely one-sided CPU loads: CPU 0 nearly 100%, all other CPUs near to 0%. Changing the Processor Affinity for the process in the Task Manager doesn't help.
Question 1: it looks like CPU 0 is doing the whole kernel's network IO work. Is that likely ?
Question 2: if yes, is there a way to control the kernel's use of available CPUs? If yes, how ?
Question 3: if no, is there any other way to make Windows distribute the (kernel) network IO work to other CPUs (i.e. by installing multiple NIC Cards, each NIC receiving only a subset of the network streams, and bind each NIC to another CPU) ?
Most thankful for any hints from anybody out there.
I'm not a Windows server guy, but this sounds like an interrupt issue. This often happens in high throughput systems, especially real-time ones.
Background:
Simply speaking, for each packet your network interface generates an interrupt, informing the CPU that it needs to handle the newly arrived data. High throughput network cards (e.g. 10Gbps) that receive small packets can easily overwhelm the CPU with these interrupts.
Just to get a feel for the problem, let's do some math - if you saturate a 10G line with 100 byte packets, that means that (ideally) 12,500,000 packets are sent over the line each second. In reality, It's less due to overhead; say 10,000,000 packets per second (pps). Your 3Ghz cpu generates 3,000,000,000 clocks per second. So it needs to handle a packet a packet every 300 clock cycles. That's pretty hard for a general purpose machine.
Now, I don't know the rate of packet arrival in your case, nor do I know your average packet length. But based on the symptoms you described, you might have run into this issue.
Solutions
Offload work to your card
Modern day network cards, especially high throughput ones, support all kinds of useful offloads such as GRO, TOE, and others. These take some network related work off the CPU (such as checksum calculation, packet fragmentation etc) and put it onto the network card which carries dedicated hardware for performing it. Check out the offloads supported by your card. In Linux, managing offloading is performed using an application called ethtool. Since I never played with offloading in windows, I can only point in the direction of the most relevant windows article I found, but I can't offer any experience-based advice.
Use interrupt throttling.
Interrupt throttling is another ability of (some) network cards and their drivers which allows them to limit the number of interrupts your CPU receives, essentially interrupting the core once every few packets instead of once per packet.
Use a multi-queue network card, and set interrupt affinities.
Some network cards have multiple (packet) queues, and therefore multiple interrupt lines, one per queue. They split incoming traffic evenly between queues using a hash function, creating (usually) 8 or 16 flows at 1/8 or at 1/16 of the line rate. Each flow can be tied to a specific CPU core using interrupt affinity, and since the hash function is calculated on IPs and port numbers, and is deterministic, each TCP/IP level session will always be handled by the same core. In Linux, setting the affinity requires writing to /proc/irq/<interrupt number>/smp_affinity. In windows, this seems to be the way.

How can I reduce system call overhead when sending a large number of UDP packets? (both Windows & Linux)

For example, I am sending 100000 UDP packets on Windows. For each packet, I need to call WSASendTo() once, so probably a lot of system call overhead is introduced. Is there a way to do bulk sending and reduce this overhead? I could not find a solution for Windows after googling for a while. Also, I would like to know if this is possible on Linux. Thanks.
On Windows you can use the new Windows Registered I/O API (RIO) on Server 2012 and Windows 8 and later.
I've written quite a lot about it here and have done several performance comparisons with the previous APIs that are available on Windows. The performance tests can be found here.
In summary: "The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. The increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go."
The results of my performance tests seem to imply that it works.
Use TransmitPackets() in Windows.
Microsoft has just recently adopted the QUIC protocol. To that end Microsoft had to improve the performance of UDP in Windows.
Therefore we now have a new "UDP send segmentation" and "UDP receive coalescing" API in Windows.
This solves the call overhead problem and takes advantage of hardware offloading on NICs which support it. This is the fastest possible method of transmitting UDP today. But with the caveat that it mostly requires all UDP packets in a call to have the same size.
Read more about it here:
https://techcommunity.microsoft.com/t5/networking-blog/making-msquic-blazing-fast/ba-p/2268963

What are good tools to take IO measurements and discover bottlenecks on linux?

I'm trying to do some tuning for Oracle on Linux boxes living on SAN based infrastructure. I'm looking specifically for tools that would allow us to profile IO per process (or per process tree would be even better). My questions are?
What are the tools that would be recommended for this kind of task?
What other useful metrics should I seek to measure on a SAN based infrastructure?
I have used "iotop" with great results. It gets specific info per process with IO usage.
It works like "top"
http://guichaz.free.fr/iotop/
I am not sure though if it would be reasonable to use from a Linux box that has the SAN mounted or if you wanted a tool that could run within the SAN.
Once you start to get this specialized, I've found that the easiest thing to do is to write some custom scripts that pull information from files under /proc.
If you're analysis for which you don't already have a tool that gives you the exact report you need, you're probably going to end up doing some scripting anyway, and most of the tools you'd use under Linux are just going to /proc to get their information anyway and then reformatting it for you.
If you're more into the databasey side of things, pulling info from /proc on a regular basis, adding timestamps, and recording it in a way that it can be imported into an RDBMS can be very useful. This can be particularly good if you put all of your server and process performance information into a single RDBMS, because then you can compare, arbitrary things such as the performance of the same application on different servers.
Keep in mind that if you go further with this, you my start adding information from different sources, such as IPMI monitoring of hosts, so don't do things that you'll have to undo once you're using more than /proc.
You can use sysstat utilities which are a collection of performance monitoring tools for
Linux.
From the website (perso.orange.fr/sebastien.godard/)
* Can monitor a huge number of different metrics:
1. Input / Output and transfer rate statistics (global, per device, per partition, per network filesystem and per Linux task / PID)
2. CPU statistics (global, per CPU and per Linux task / PID), including support for virtualization architectures
3. Memory and swap space utilization statistics
4. Virtual memory, paging and fault statistics
5. Per-task (per-PID) memory and page fault statistics
6. Global CPU and page fault statistics for tasks and all their children
7. Process creation activity
8. Interrupt statistics (global, per CPU and per interrupt, including potential APIC interrupt sources)
9. Extensive network statistics: network interface activity (number of packets and kB received and transmitted per second, etc.) including failures from network devices; network traffic statistics for IP, TCP, ICMP and UDP protocols based on SNMPv2 standards; support for IPv6-related protocols.
10. NFS server and client activity
11. Socket statistics
12. Run queue and system load statistics
13. Kernel internal tables utilization statistics
14. System and per Linux task switching activity
15. Swapping statistics
16. TTY device activity
17. Power management statistics
I usually use atop to monitor the load on my systems. Some features require that you patch the kernel, but it gives precise information about I/O as well as other info.
What other useful metrics should I seek to measure on a SAN based infrastructure?
CPU load. It's Main metrics for oracle database.
Depending on how low-level you want to get, System Tap could be very useful for you. It is similar to DTrace on Solaris.

Resources