Multicasting + Linux Kernel

Multicasting + Linux Kernel - linux

I have one doubt regarding multicasting in linux kernel. When multicast data arrives
linux kernel checks MFC and if the matching entry is not found then kernel gives conrol message cache miss and header to the user space. My question is what happens to the data
packet? Suppose i may deliberately not want to keep the entry inside MFC but i may have some
other table which has got forwarding information and i want to use that one then what to do?
Regards,
Bhavin.

If a data packet arrives for which there is no matching MFC entry, then the data packet gets put into a queue. It will stay in that queue until either an MFC entry gets added that matches that packet or a timeout expires (10 seconds), whichever happens first. The queue itself has a limit of 10 entries, and once that limit is reached no more packets will get put onto the queue. In that case, unresolved packets will get dropped.
I don't think Linux supports having multiple MFC tables (but I could be wrong). As an alternative, you could route these multicast packets in userspace using by receiving them on a raw socket and then forwarding them out whatever interface you like. In fact many of the IPv6 multicast routing daemons used a method like this before IPv6 multicast support on Linux matured.

you can check it that if related kernel compiled multicast support using command below
grep -i "multicast" /boot/config-2.6.32-358.6.1.el6.x86_64
/UE

Related

Query related to AF_XDP for transmission of data

I'm developing a user space application where my end goal is to send data from a Linux machine A (an embedded device) and receive on another Linux machine B (an embedded device) over AF_XDP. I want to use AF_XDP to achieve low latency.
I understand for my user space application to receive data over AF_XDP I should use XDP_REDIRECT (please correct me if I'm wrong), what I can't understand is which XDP option (XDP_REDIRECT, XDP_TX etc) should I use to transmit data from my user space application over AF_XDP?

what I can't understand is which XDP option (XDP_REDIRECT, XDP_TX etc) should I use to transmit data from my user space application over AF_XDP?
That is not how AF_XDP works. Once all of the setup work is done, by you or a library you use, there will be a shared memory region called UMEM and 4 ringbuffers which are used in a sort of dance to receive and transmit packets. The buffers are called: UMEM_Fill, UMEM_Completion, RX, TX
On the ingress side, your XDP program is triggered so you can make a decision to send traffic to your AF_XDP socket or not. Here you will use the bpf_redirect_map to redirect the traffic into the socket/map which will return XDP_REDIRECT if successful (may fail if the socket isn't setup correctly or buffers are full).
To read the redirected packet you dequeue the RX queue/ring, the frame descriptor within will point to an address in UMEM which is where your packet data is located. As soon as you are done with this memory you send the frame descriptor back over the UMEM_Fill queue/ring so the NIC/Driver can re-use it.
To write a packet you dequeue a frame descriptor from the UMEM_Completion queue/ring fill the region of UMEM you have temporary control over with your packet and give the frame descriptor to the TX queue/ring. The NIC/Driver will consume the TX queue/ring and send out the packet. No XDP program is triggered on egress.
Would highly recommend you checkout https://www.kernel.org/doc/html/latest/networking/af_xdp.html which has more details regarding using AF_XDP.

How do I increase TCP retransmission rate in Linux per-connection?

I'm using TCP over a very lossy network system (almost 20% drop rate) but one with extremely low latency (<2ms). And the default TCP implementation on our Linux system is atrocious. Sometimes it waits 5-6 seconds before re-transmitting packets. On the other side, our TCP stack just retries every 20ms, and it's fine.
I can't find any way to manually incur a re-transmission, even with TCP_NODELAY and sending no data aggressively. Also, there do not appear to be per-socket configurations for this. As we only want to change the timing for specific sockets (the ones on this network).
Is there any kernel feature for either manually re-transmitting with TCP or aggressively set the timers such to allow many retries per second?
(I am aware this is a similar (but not the same)) issue as the person in this thread: Application control of TCP retransmission on Linux -- but I do not want to close the connection, like TCP_USER_TIMEOUT just make it keep retrying, a lot.

TCP_NODELAY is an entirely unrelated options - it disables Nagle's algorithm for consolidating packets - it is used on interactive connection such as ssh where it is desirable to have every single character in a separate packet (to get an immediate remote echo).
Normally, the TCP layer will continuously monitor the RTT (Round-Trip Time) for receiving an ACK and will readjust its timeout. This is called Karn's Algorithm.
AFAIK, There are no tunables in the Linux kernel when it comes to retransmissions because this is something that is specified by the RFCs and it is embedded in the protocol.
I suggest you read this introduction: https://www.catchpoint.com/blog/tcp-rtt and then to capture a packet dump containing those exceptionally long retransmissions and post them here.
Also, are you sure that the packet loss is the same in both directions? This can be an explanation.

Can I use SO_REUSEPORT to distribute a single UDP flow to multiple receiver threads?

My Linux application needs to receive a single UDP flow with modestly-sized packets (~1 KB) at a rate on the order of ~600,000 packets per second. My current implementation is naive: it has a single thread that simply calls recv() repeatedly, placing the received data in a queue to be processed by another thread. Therefore, the receiver thread is only in charge of pulling in the packets.
In some initial testing that I've done, I'm only able to receive between 200,000-300,000 packets per second before the thread reaches full utilization of its CPU core. This obviously isn't good enough to meet the goal of ~600,000 packets per second.
Ideally, I would find some way of distributing the packet reception load across multiple threads. In looking for a solution to the problem, I came across the SO_REUSEPORT socket option, which allows multiple TCP/UDP threads to be bound to the same IP/port combination. At first, this seemed to be exactly what I wanted.
However, the article also points out this detail:
Incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection—that is, the peer IP address and port plus the local IP address and port. This means, for example, that if a client uses the same socket to send a series of datagrams to the server port, then those datagrams will all be directed to the same receiving server (as long as it continues to exist). This eases the task of conducting stateful conversations between the client and server.
Therefore, if I only have a single UDP flow, the above hashing implementation would yield all of the packets being directed to the same receiver thread, thwarting my attempt at parallelizing the work. Therefore, the question is: is there a way to receive a single flow of UDP packets from multiple threads, using SO_REUSEPORT or some other mechanism?
Note that my application can handle reordering of packets; the protocol that the datagrams are formatted with contains sequencing information that I can use to reorder them properly afterward.

If you didn't find the solution for last 3 years take a look at SO_ATTACH_REUSEPORT_CBPF. We had exactly the same issue and we solved it by attaching simple BPF program which distributes datagrams randomly mod n.

Low latency serving same data to many clients (multicasting or not...)

I need to send identical information to 100's of clients over the Internet. I currently maintain a list of client connections and iterate over the list. Obviously the longer the list gets the more latency there is toward the end of the list.
I have looked at multicasting. However unless I am missing something it is only good for LAN-based communications at present. It requires routers that support multicasting and most routers do not. There is no mechanism that I can see where one requests an available multicast address to avoid broadcasting to an address already in use.
So my questions are:
1) Am I missing something and can I use multicasting to accomplish this? (have tried without success)
2) Other than multicasting, is there a short cut to sending identical packets to many recipients?

I solved the problem by multicasting between threads in the server. Every client connection results in the creation of an object. These objects are stored in a queue. Each object has its own thread and joins the multicast group. When the server multicasts a string to the client objects the delay that arose from the list iteration no longer occurs.
Every now and then there is huge latency (nearly a second). I suspect that this is a JVM thing.

If you need high performing low latency IO, you shoud try http://nodejs.org/
You may be also interested in some cache http://memcached.org/

Linux multicast sendto() performance degrades with local listeners

We have a "publisher" application that sends out data using multicast. The application is extremely performance sensitive (we are optimizing at the microsecond level). Applications that listen to this published data can be (and often are) on the same machine as the publishing application.
We recently noticed an interesting phenomenon: the time to do a sendto() increases proportionally to the number of listeners on the machine.
For example, let's say with no listeners the base time for our sendto() call is 5 microseconds. Each additional listener increases the time of the sendto() call by about 2 microseconds. So if we have 10 listeners, now the sendto() call takes 2*10+5 = 25 microseconds.
This to me suggests that the sendto() call blocks until the data has been copied to every single listener.
Analysis of the listening side supports this as well. If there are 10 listeners, each listener receives the data two microseconds later than the previous. (I.e., the first listener gets the data in about five microseconds, and the last listener gets the data in about 23--25 microseconds.)
Is there any way, either at the programmatic level or the system level to change this behavior? Something like a non-blocking/asynchronous sendto() call? Or at least block only until the message is copied into the kernel's memory, so it can return without waiting on all the listeners)?

Multicast loop is incredibly inefficient and shouldn't be used for high performance messaging. As you noted for every send the kernel is copying the message to every local listener.
The recommended approach is to use a separate IPC method to distribute to other threads and processes on the same host, either shared memory or unix sockets.
For example this can easily be implemented using ZeroMQ sockets by adding an IPC connection above a PGM multicast connection on the same ZeroMQ socket.

Sorry for asking the obvious, but is the socket nonblocking? (add O_NONBLOCK to the set of flags for the port -- see fcntl)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string