How can I reduce system call overhead when sending a large number of UDP packets? (both Windows & Linux)

How can I reduce system call overhead when sending a large number of UDP packets? (both Windows & Linux) - linux

For example, I am sending 100000 UDP packets on Windows. For each packet, I need to call WSASendTo() once, so probably a lot of system call overhead is introduced. Is there a way to do bulk sending and reduce this overhead? I could not find a solution for Windows after googling for a while. Also, I would like to know if this is possible on Linux. Thanks.

On Windows you can use the new Windows Registered I/O API (RIO) on Server 2012 and Windows 8 and later.
I've written quite a lot about it here and have done several performance comparisons with the previous APIs that are available on Windows. The performance tests can be found here.
In summary: "The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. The increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go."
The results of my performance tests seem to imply that it works.

Use TransmitPackets() in Windows.

Microsoft has just recently adopted the QUIC protocol. To that end Microsoft had to improve the performance of UDP in Windows.
Therefore we now have a new "UDP send segmentation" and "UDP receive coalescing" API in Windows.
This solves the call overhead problem and takes advantage of hardware offloading on NICs which support it. This is the fastest possible method of transmitting UDP today. But with the caveat that it mostly requires all UDP packets in a call to have the same size.
Read more about it here:
https://techcommunity.microsoft.com/t5/networking-blog/making-msquic-blazing-fast/ba-p/2268963

Related

How to avoid DBus for Linux in embedded environment?

I am working in a Linux based embedded project with C/C++ and python applications. And we need an Inter Process Communication (IPC) method to transport JSON based messages between those applications. Initially DBus was an obvious option since it is present in almost all Linux distributions and is quite stable and proved software. Also there are libraries for many programming languages. Also DBus has a very granular and nice permission system - which is a requirement for our project (security reasons).
But unfortunately we have experienced some drawbacks of DBus:
We have hit some stability bugs like in some specific congestion situations there were some memory leaks which lead to dead IPC and only application restart helped.
Only the usage of DBus introduced 3-5 MB of ram usage per each application (which on a system with 512 MB RAM and multiplied by 25 applications does make some room for improvements).
The data flow model (signals / methods) seem to be a bit too complicated for the use-case we need.
Our next idea is to switch to some of Message broker available. But we also look for some nice to have features:
Be able to broadcast or Multicast messages to multiple applications
To have presence of applications when they connect/disconnect from the bus-server (the server can broadcast when new applications connect and when applications disconnect).
Watchdog of connected applications. Sometimes the apps might behave wrong on the IPC (by not answering to IPC messages) and the server with watchdog could detect that and disconnect that application and inform others that the application is dead.
How do we avoid DBus in this scenario?

differences between DPDK and Netfilter

I want to bypass the Linux network stack and transform raw packets to my custom codes in userland and handle them in there.
I know that you can make your custom drivers using pf-rings or DPDK and others. But I can not understand why should I make these kinds of drivers while I can use the Netfilter and hook my module to NF_IP_PRE_ROUTING state and send the packets to userland.
It would be a great help for me if anyone can explain me the main differences between them.

There is a huge difference between DPDK and Netfilter hooks. When using Netfilter / hooking NF_IP_PRE_ROUTING you hijack the packet flow and copy packets form kernel space to user space. This copy causes a large overhead.
When using DPDK you're actually mapping you network card's packet buffers to a userspace memory area. Meaning that instead of the kernel getting an interrupt from the NIC, then passing it through all its queues until it reaches NF_IP_PRE_ROUTING which in turn will copy the packer to userland upon request, DPDK offers you the possibility to access the mapped packet buffers directly from userspace, bypassing all meta-handling by the kernel, effectively improving performance (at the cost of code complexity and security).

There are a variety of techniques to grab raw packets and deliver them to a userspace application. The devil as usual in the details.
If all we need is to deliver packets to a userspace application -- there is no difference what solution to use. Libpcap, or tun/taps, or Netfilter, or pf-ring, or whatever. All will do just fine.
But if we need to process 100 million packets per second (~30 CPU cycles per packet on 3GHz) -- I don't think we have other options at the moment but DPDK. Google for "DPDK performance report" and have a look.
DPDK is a framework which works well on many platforms (x86, ARM, POWER etc) and supports many NICs. There is no need to write a driver, the support for the most popular NICs is already there.
There is also a support to manage CPU cores, huge pages, memory buffers, encryption, IP fragmentation etc etc. All designed to be able to forward 100 Mpps. If we need that performance...

Accessing wireless interface (802.11) at MAC layer (Linux)

I spent the last days reading through man pages, documentations and anything else google brought up, but I suppose I'm even more confused now than I was at the beginning.
Here is what I want to do: I want to send and receive data packets with my own layer 3-x protocol(s) via a wireless interface (802.11) on Linux systems with C/C++.
So far, so good. I do not require beacons, association or any AP/SSID related stuff. However, for data transmissions I'd like the MAC layer to behave "as usual", meaning unicast packets are ACK'd, retransmissions, backoff etc. I'd also like to enjoy the extended QoS capabilites (802.11e with 4 queues and different access categories). Promiscuous mode on the other hand is not a concern, I require only broadcast packets and packets sent to the specific station.
What would be the right way to go about it? Most of the documentation out there on raw socket access seems to be focused on network sniffing and that does not help. I've been playing around with the monitor mode for some time now, but from what I've read so far, received packets are not ACK'd in monitor mode etc.
Without monitor mode, what would be the alternative? Using ad hoc mode and unix raw sockets? Or do I have to fiddle around with the drivers?
I'm not looking for a complete solution, just some good ideas, where to start. I read through the man pages for socket(2), socket(7) and packet(7) but that did not help concerning the behaviour of the MAC layer in different modes.
Thanks in advance.

802.11 is layer 2 (and 1) protocol specification. It was designed in a way, which allows higher-layer protocols to treat it as Ethernet network. Addressing and behaviour is generally the same. So for a layer 3 protocol you should not be concerned about 802.11 at all and write your code as if you were expecting it to run on Ethernet network.
To make it work you should first connect to a wireless network of some sort (which is conceptually equal to plugging a wire into a Ethernet card). Here you may choose ad-hoc (aka IBSS) or infrastructural (aka BSS) network (or PBSS once 802.11ad is approved ;).
Operating cards without any sort of association with network (just spitting out packets on air) is not a good idea for a couple of reasons. Most importantly it's very hardware dependent and unreliable. You can still do it using RF mon (AKA monitor mode) interface on one side and packet injection (using radiotap header) on the other but I don't recommend that. Even if you have a set of identical cards you'll most likely encounter hard to explain and random behaviour at some point. 802.11 NICs are just not designed for this kind of operation and keep different mount of state inside firmware (read about FullMAC vs. SoftMAC cards). Even SoftMAC cards differ significantly. For example theoretically in monitor mode, as you said, card should not ACK received packets. There are cards though that will ACK received frame anyway, because they base their decision exclusively on the fact that said frame is addressed to them. Some cards may even try to ACK all frames they see. Similar thing will happen with retransmissions: some cards will send injected packet only once (that's how it should work). In other NICs, retransmissions are handled by hardware (and firmware) and driver cannot turn it off, so you will get automatic retransmission even with injected data.
Sticking with layer 3 and using existing modes (like ad hoc), will give you all capabilities you want and more (QoS etc.). Ethernet frame that you send to interface will be "translated" by the kernel to 802.11 format with QoS mapping etc.
If you want to find out about MAC behaviour in various modes you'll have to either read the mac80211 code or 802.11 standard itself. http://linuxwireless.org wiki my help you with a few things, but kernel hackers are usually to busy to write documentation other than comments in the code ;)
L3 protocol implementation itself can be also done either in kernel or user mode (using raw sockets). As usual kernel-side will be harder to do, but more powerful.

Because you want to create own network layer protocol (replacement for IP), the keyword is: "raw ethernet socket". So ignore "Raw IP socket" stuff.
This is where to start:
int sockfd = socket( PF_PACKET, SOCK_RAW, htons(XXX) );
Correct man page is: packet(7).
Find more information by googling with the keyword.
One quite complete example here.
Edit: The link to the example seems to be currently broken: another examples

Probably you want something like libpcap.
Libpcap allows you to read/inject raw packets from/into a network interface.

First, there’s something you should be aware of when trying to transmit raw 802.12 frames- the device driver must support packet injection.
You mentioned monitor mode, which is at a high level the rx equivalent of the injection capability- which is not a “mode”, jist a capability/feature. I say this because some 892.11 device drivers on Linux either:
Support monitor mode and frame injection
Support monitor mode and not frame injection
Support neither
I don’t know any straightforward way to check if the driver supports frame injection aside from attempting frame injection and sniffing the air on another device to confirm it was seen.
Monitor mode is usually easy to check by using sudo wlan0 set monitor and seeing what the return code and/or output is.
It’s been a few years since I’ve worked on this but at the time, very few devices supported monitor mode and frame injection “out of the box”. Many only supported monitor mode with a modified version of the vendor or kernel driver
You’ll want to make sure your device has a driver available that fully supports both. This sort of task (frame monitoring and injection) is common for Penetration Testers who tend to use Kali Linux, which is really just an Ubuntu distribution with a bunch of “hacking” tools and (modified) 802.11 device drivers preloaded and in its repositories. You can often save time finding a well supported card by using a search engine to find the device and driver recommended for Kali users
I’m bringing this monitor/injection capability up explicitly because when I first worked on a similar project a few years ago, I needed to use a patched version of the official kernel driver to support monitor mode- it was an rtl8812au chipset. At that time, I made an incorrect assumption that monitor mode support in the driver implied full injection support. I spent 2 days banging my head against the wall, convinced my frames weren’t built correctly in my application, causing no frames to leave the card. Turned out I needed a more recent branch of the driver I was using to get the full injection support. This driver in particular supports both monitor mode and frame injection now. The most frustrating thing about diagnosing that problem was that I did not receive any errors from system calls or in kernel messages while trying to transmit the frames- they were just being silently discarded somewhere, presumably in the driver
To your main question about how to do this- the answer is almost certainly libpcap if you’re writing your application in C/C++ as libpcap provides not only packet capture APIs but also packet injection APIs
If you do it in Python, scapy is an excellent option. The benefit of Python/scapy is that
Python code is much quicker to write than C
scapy provides a significant amount of classes that you can use to intuitively create a frame layer by layer
Because the layers are implemented as classes, you can also extend and “register” existing classes to make certain frames easier to create (or parse when received)
You can do this in straight C using the UNIX sockets API with raw sockets directly- but you’ll have to deal with things that libpcap exists to abstract from you- like underlying system calls that may be required when doing raw frame transmission, aside from the standard socket(), send(), recv() calls. I’m speculating that there are a handful of ioctl calls you may need at the least, specific to the kernel 802.11x subsystem/framework- and these ioctl() calls and their values may not be completely portable across different major kernel versions. I’ll admit I ended up not using the pure C (without libpcap) approach, so I’m not 100% sure about this potential problem. It’s something you should look more into if you plan to do it without libpcap. I don’t recommend it unless you have a really good reason to

It sounds like you are getting the media and transport layers mixed up.
802.11 is what's commonly referred to as a "link", "physical", or "media" layer, meaning it only deals with the transmission of raw datagrams.
Concepts like ACKs, retransmissions, backoff (flow control) apply to the "transport" layer, and those particular terms are strongly associated with TCP/IP.
Implementing your own complete transport layer from scratch is very difficult and almost certainly not what you want to do. If instead you want to use the existing TCP/IP stack on top of your own custom interpretation of 802.11, then you probably want to create a virtual network interface. This would act as an intermediary between TCP/IP and the media layer.
Hopefully this gives you some better context and keywords to look for.

How many socket connections possible?

Has anyone an idea how many tcp-socket connections are possible on a modern standard Linux server?
(There is in general less traffic on each connection, but all the connections have to be up all the time.)

I achieved 1600k concurrent idle socket connections, and at the same time 57k req/s on a Linux desktop (16G RAM, I7 2600 CPU). It's a single thread http server written in C with epoll. Source code is on github, a blog here.
Edit:
I did 600k concurrent HTTP connections (client & server) on both the same computer, with JAVA/Clojure . detail info post, HN discussion: http://news.ycombinator.com/item?id=5127251
The cost of a connection(with epoll):
application need some RAM per connection
TCP buffer 2 * 4k ~ 10k, or more
epoll need some memory for a file descriptor, from epoll(7)
Each registered file descriptor costs roughly 90
bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit kernel.

This depends not only on the operating system in question, but also on configuration, potentially real-time configuration.
For Linux:
cat /proc/sys/fs/file-max
will show the current maximum number of file descriptors total allowed to be opened simultaneously. Check out http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html

A limit on the number of open sockets is configurable in the /proc file system
cat /proc/sys/fs/file-max
Max for incoming connections in the OS defined by integer limits.
Linux itself allows billions of open sockets.
To use the sockets you need an application listening, e.g. a web server, and that will use a certain amount of RAM per socket.
RAM and CPU will introduce the real limits. (modern 2017, think millions not billions)
1 millions is possible, not easy. Expect to use X Gigabytes of RAM to manage 1 million sockets.
Outgoing TCP connections are limited by port numbers ~65000 per IP. You can have multiple IP addresses, but not unlimited IP addresses.
This is a limit in TCP not Linux.

10,000? 70,000? is that all :)
FreeBSD is probably the server you want, Here's a little blog post about tuning it to handle 100,000 connections, its has had some interesting features like zero-copy sockets for some time now, along with kqueue to act as a completion port mechanism.
Solaris can handle 100,000 connections back in the last century!. They say linux would be better
The best description I've come across is this presentation/paper on writing a scalable webserver. He's not afraid to say it like it is :)
Same for software: the cretins on the
application layer forced great
innovations on the OS layer. Because
Lotus Notes keeps one TCP connection
per client open, IBM contributed major
optimizations for the ”one process,
100.000 open connections” case to Linux
And the O(1) scheduler was originally
created to score well on some
irrelevant Java benchmark. The bottom
line is that this bloat beneﬁts all of
us.

On Linux you should be looking at using epoll for async I/O. It might also be worth fine-tuning socket-buffers to not waste too much kernel space per connection.
I would guess that you should be able to reach 100k connections on a reasonable machine.

depends on the application. if there is only a few packages from each client, 100K is very easy for linux. A engineer of my team had done a test years ago, the result shows : when there is no package from client after connection established, linux epoll can watch 400k fd for readablity at cpu usage level under 50%.

Which operating system?
For windows machines, if you're writing a server to scale well, and therefore using I/O Completion Ports and async I/O, then the main limitation is the amount of non-paged pool that you're using for each active connection. This translates directly into a limit based on the amount of memory that your machine has installed (non-paged pool is a finite, fixed size amount that is based on the total memory installed).
For connections that don't see much traffic you can reduce make them more efficient by posting 'zero byte reads' which don't use non-paged pool and don't affect the locked pages limit (another potentially limited resource that may prevent you having lots of socket connections open).
Apart from that, well, you will need to profile but I've managed to get more than 70,000 concurrent connections on a modestly specified (760MB memory) server; see here http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html for more details.
Obviously if you're using a less efficient architecture such as 'thread per connection' or 'select' then you should expect to achieve less impressive figures; but, IMHO, there's simply no reason to select such architectures for windows socket servers.
Edit: see here http://blogs.technet.com/markrussinovich/archive/2009/03/26/3211216.aspx; the way that the amount of non-paged pool is calculated has changed in Vista and Server 2008 and there's now much more available.

Realistically for an application, more then 4000-5000 open sockets on a single machine becomes impractical. Just checking for activity on all the sockets and managing them starts to become a performance issue - especially in real-time environments.

How to implement web services on an embedded device?

We have an embedded device that needs to interact with an enterprise software system.
The enterprise system currently uses many different mechanisms for communication between its components: ODBC, RPC, proprietary protocol over TCP/IP, and is moving to .Net-implmented web services.
The embedded device runs a flavor of *nix, so we're looking at what the best interaction mechanism is.
The requirements for the communication are:
Must run over TCP/IP.
Must also run over RS-232 or USB.
Must be secure (e.g. HTTPS or SSL).
Must be capable of transferring ~32MB of data.
Our current best option is gSOAP.
Does anyone out there in SO-land have any other suggestions?
Edit: Steven's answer gave me the most new pointers. Thanks to all!

You can define RESTful services the use HTTPS (which uses TCP/IP by definition) and is capable of transferring any amount of data.
The advantage of REST over SOAP is that REST is simpler. It can use JSON instead of XML which is simpler.
It has less overhead than the SOAP protocol.

Can't you just use SSL over TCP?
If you have some kind of *nix (may I guess? It's either QNX or embedded linux, right?) it should work pretty much out of the box via Ethernet, USB and RS232. Keep thing simple.
32mb is plenty of memory for this task. I would allocate between 2 and 4 mb of memory for networking & encryption (code + data).

It's not real clear why you want to tie this to a remote-procedure-call protocol like SOAP. Are there other requirements you aren't mentioning?
In general, though, this sort of thing is handled very easily using normal web-based services. You can get very lightweight http processors written in C; see this Wikipedia article for comparisons of a number of them. Then a REST interface will work fine. There are network interfaces that treat USB as a TCP connection, as well.
If you must be able to run over RS232, you might want to look elsewhere; in that case, something like sftp might do better. Or write a simple application-layer protocol that you can run over an encrypted connection.

If you are going to connect your application using RS232, I assume that you will be using PPP to connect the device to the internet. The amount of data that you are proposing to transfer is somewhat worrisome, however. Most RS232 connections are limited to 115200 baud which, ignoring the overhead required for TCP/IP/PPP framing is going to yield a transfer rate of at most 11,000 bytes per second. This implies that it will take a minimum of approximately 2800 seconds or 46 minutes to make whatever transfer that you intend.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string