Change congestion control algorithms per connection - linux

The command 'sysctl' in linux as of now changes the congestion control algorithm globally for the entire system. But congestion control, where the TCP window size and other similar parameters are varied, are normally done per TCP connection. So my question is:
Does there exist a way where I can change the congestion control algorithm being used per TCP connection?
Or am I missing something trivial here? If so, what is it?

This is done in iperf using the -Z option - the patch is here.
This is how it is implemented (PerfSocket.cpp, line 93) :
if ( isCongestionControl( inSettings ) ) {
#ifdef TCP_CONGESTION
Socklen_t len = strlen( inSettings->mCongestion ) + 1;
int rc = setsockopt( inSettings->mSock, IPPROTO_TCP, TCP_CONGESTION,
inSettings->mCongestion, len);
if (rc == SOCKET_ERROR ) {
fprintf(stderr, "Attempt to set '%s' congestion control failed: %s\n",
inSettings->mCongestion, strerror(errno));
exit(1);
}
#else
fprintf( stderr, "The -Z option is not available on this operating system\n");
#endif
Where mCongestion is a string containing the name of the algorithm to use

It seems this is possible via get/setsockopt. The only documentation i found is:
http://lkml.indiana.edu/hypermail/linux/net/0811.2/00020.html

In newer versions of Linux it is possible to set the congestion control for a specific destination using ip route ... congctl .
If anyone are familiar with this approach, please edit this post.

As far as I know, there's no way to change default TCP congestion control per process (I'd love bash script being able to say that whatever is executed by this script should default the congestion control lp).
The only user mode API I'm aware of is as follows:
setsockopt(socket, SOL_TCP, TCP_CONGESTION, congestion_alg, strlen(congestion_alg));
where socket is an open socket, and congestion_alg is a string containing one of the words in /proc/sys/net/ipv4/tcp_allowed_congestion_control.

Linux has pluggable congestion algorithms which can change the algorithm used on the fly but this is a system wide setting not per connection.

Related

Where did Wireshark/tcpdump/libpcap intercept packet inside Linux kernel?

According to this, wireshark is able to get the packet before it is dropped (therefore I cannot get such packets by myself). And I'm still wondering the exact location in linux kernel for wireshark to fetch the packets.
The answer goes as "On UN*Xes, it uses libpcap, which, on Linux, uses AF_PACKET sockets." Does anyone have more concrete example to use "AF_PACKET sockets"? If I understand wireshark correctly, the network interface card (NIC) will make a copy of all incoming packets and send it to a filter (berkeley packet filter) defined by the user. But where does this happen? Or am I wrong with that understanding and do I miss anything here?
Thanks in advance!
But where does this happen?
If I understood you correctly - you want to know, where is initialized such socket.
There is pcap_create function, that tries to determine type of source interface, creates duplicate of it and activates it.
For network see pcap_create_interface function => pcap_create_common function => pcap_activate_linux function.
All initialization happens in pcap_activate_linux => activate_new function => iface_bind function
( copy descriptor of device with handlep->device = strdup(device);,
create socket with socket(PF_PACKET, SOCK_DGRAM, htons(ETH_P_ALL)),
bind socket to device with bind(fd, (struct sockaddr *) &sll, sizeof(sll)) ).
For more detailed information read comments in source files of mentioned functions - they are very detailed.
After initialization all work happens in a group of functions such as pcap_read_linux, etc.
On Linux, you should be able to simply use tcpdump (which leverages the libpcap library) to do this. This can be done with a file or to STDOUT and you specify the filter at the end of the tcpdump command..

force socket disconnect without forging RST, Linux

I have a network client which is stuck in recvfrom a server not under my control which, after 24+ hours, is probably never going to respond. The program has processed a great deal of data, so I don't want to kill it; I want it to abandon the current connection and proceed. (It will do so correctly if recvfrom returns EOF or -1.) I have already tried several different programs that purport to be able to disconnect stale TCP channels by forging RSTs (tcpkill, cutter, killcx); none had any effect, the program remained stuck in recvfrom. I have also tried taking the network interface down; again, no effect.
It seems to me that there really should be a way to force a disconnect at the socket-API level without forging network packets. I do not mind horrible hacks, up to and including poking kernel data structures by hand; this is a disaster-recovery situation. Any suggestions?
(For clarity, the TCP channel at issue here is in ESTABLISHED state according to lsof.)
I do not mind horrible hacks
That's all you have to say. I am guessing the tools you tried didn't work because they sniff traffic to get an acceptable ACK number to kill the connection. Without traffic flowing they have no way to get hold of it.
Here are things you can try:
Probe all the sequence numbers
Where those tools failed you can still do it. Make a simple python script and with scapy, for each sequence number send a RST segment with the correct 4-tuple (ports and addresses). There's at most 4 billion (actually fewer assuming a decent window - you can find out the window for free using ss -i).
Make a kernel module to get hold of the socket
Make a kernel module getting a list of TCP sockets: look for sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[i].chain)
Identify your victim sk
At this point you intimately have access to your socket. So
You can call tcp_reset or tcp_disconnect on it. You won't be able to call tcp_reset directly (since it doesn't have EXPORT_SYMBOL) but you should be able to mimic it: most of the functions it calls are exported
Or you can get the expected ACK number from tcp_sk(sk) and directly forge a RST packet with scapy
Here is function I use to print established sockets - I scrounged bits and pieces from the kernel to make it some time ago:
#include <net/inet_hashtables.h>
#define NIPQUAD(addr) \
((unsigned char *)&addr)[0], \
((unsigned char *)&addr)[1], \
((unsigned char *)&addr)[2], \
((unsigned char *)&addr)[3]
#define NIPQUAD_FMT "%u.%u.%u.%u"
extern struct inet_hashinfo tcp_hashinfo;
/* Decides whether a bucket has any sockets in it. */
static inline bool empty_bucket(int i)
{
return hlist_nulls_empty(&tcp_hashinfo.ehash[i].chain);
}
void print_tcp_socks(void)
{
int i = 0;
struct inet_sock *inet;
/* Walk hash array and lock each if not empty. */
printk("Established ---\n");
for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
struct sock *sk;
struct hlist_nulls_node *node;
spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, i);
/* Lockless fast path for the common case of empty buckets */
if (empty_bucket(i))
continue;
spin_lock_bh(lock);
sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[i].chain) {
if (sk->sk_family != PF_INET)
continue;
inet = inet_sk(sk);
printk(NIPQUAD_FMT":%hu ---> " NIPQUAD_FMT
":%hu\n", NIPQUAD(inet->inet_saddr),
ntohs(inet->inet_sport), NIPQUAD(inet->inet_daddr),
ntohs(inet->inet_dport));
}
spin_unlock_bh(lock);
}
}
You should be able to pop this into a simple "Hello World" module and after insmoding it, in dmesg you will see sockets (much like ss or netstat).
I understand that what you want to do it's to automatize the process to make a test. But if you just want to check the correct handling of the recvfrom error, you could attach with the GDB and close the fd with close() call.
Here you could see an example.
Another option is to use scapy for crafting propper RST packets (which is not in your list). This is the way I tested the connections RST in a bridged system (IMHO is the best option), you could also implement a graceful shutdown.
Here an example of the scapy script.

'close' function on serial port descriptor blocks in linux

Recently I've found a problem which is quite new for me and I'd appreciate advice. I'm doing serial communication on Linux using termios functions. I actually don't use real serial port, but virtual gadget serial driver /dev/ttyGS0. File descriptor is opened as non-blocking.
My program periodically generates data and sends it to /dev/ttyGS0. There is no information if the other end reads it or not. If it does not, some internal fifo fills up and write returns "would block" error. So far so good, I have no problems with that.
Problem is, when I want to close such file descriptor with filled fifo, close functions blocks! Not indefinitely, but for about 10 seconds.
I tried to do tcflush(uart->fd, TCOFLUSH) before closing without any effect.
This is so strange behavior to me and I found no description, that close could block. Is there any way how to avoid this? Or at least decrease this timeout? Where should I look for this timeout? VTIME attribute has also no effect to this.
As Amardeep mentioned, the close() call is handled by the driver. Close itself is always a blocking call, but generally it's a fast one.
So, the answer is that the delay is specific to the virtual gadget driver. I don't have experience with that one to help.
How important is it to close the file? If the delay is a major problem and the file needs to be closed (such as avoiding file descriptor leaks in a long-running process), then the close will probably need to be called in a separate thread. Obviously, the best answer would be one specific to that driver; perhaps research there might yield an answer, such as an ioctl() call that clears the state of the virtual device.
You may need to configure your port's closing_wait parameter. From the setserial manual:
closing_wait delay
Specify the amount of time, in hundredths of a second, that the kernel should wait for data to be transmitted from
the serial port while
closing the port. If "none" is specified, no delay will occur. If "infinite" is specified the kernel will wait
indefinitely for the
buffered data to be transmitted. The default setting is 3000 or 30 seconds of delay. This default is generally
appropriate for most
devices. If too long a delay is selected, then the serial port may hang for a long time if when a serial port which is
not connected, and
has data pending, is closed. If too short a delay is selected, then there is a risk that some of the transmitted data is
output at all.
If the device is extremely slow, like a plotter, the closing_wait may need to be larger.
Check with setserial the parameters for your port:
$ setserial -g -a /dev/ttyS0
/dev/ttyS0, Line 0, UART: 16550A, Port: 0x03f8, IRQ: 4
Baud_base: 115200, close_delay: 50, divisor: 0
closing_wait: 3000
Flags: spd_normal skip_test
In my case, a faulting device was not receiving the last bytes I sent it, and closing the port always took 30 seconds because of this. You can change this timeout with setserial, for example, to 1 second:
$ sudo setserial /dev/ttyS0 closing_wait 100
Of course, you may want to issue this command on startup in your /etc/rc.local or whatever script your distro uses to configure your ports.
I faced the same issue, in my case disabling flowcontrol before closing the device helped. You can do this using following function:
int set_flowcontrol(int fd, int control)
{
struct termios tty;
memset(&tty, 0, sizeof tty);
if (tcgetattr(fd, &tty) != 0)
{
perror("error from tggetattr");
return -1;
}
if(control) tty.c_cflag |= CRTSCTS;
else tty.c_cflag &= ~CRTSCTS;
if (tcsetattr(fd, TCSANOW, &tty) != 0)
{
perror("error setting term attributes");
return -1;
}
return 0;
}
Just call this before closing:
...
rc = set_flowcontrol(fd, 0);
if (rc != 0)
{
perror("error setting flowcontrol: ");
exit(-1);
}
rc = close(fd);
if (rc != 0)
{
perror("error closing fd: ");
exit(-1);
}
...

sendto on Tru64 is returning ENOBUF

I am currently running an old system on Tru64 which involves lots of UDP sockets using the sendto() function. The sockets are used in our code to send messages to/from various processes and then eventually on to a thick client app that is connected remotely. Occasionally the socket to the thick client gets stuck, this can cause some of these messages to get built up. My question is how can I determine the current buffer size, and how do I determine the maximum message buffer. The code below gives a snippet of how I set up the port and use the sendto function.
/* need to adjust the maximum size we can send on this */
/* as it needs to be able to cope with the biggest */
/* messages we send */
lenlen = sizeof(len) ;
/* allow double for when the system is under load */
int lenlen, len ;
lenlen = sizeof(len) ;
len = 2 * 32000;
msg_socket = socket( AF_UNIX,SOCK_DGRAM, 0);
result = setsockopt(msg_socket, SOL_SOCKET, SO_SNDBUF, (char *)&len, lenlen) ;
result = sendto( msg_socket,
(char *)message,
(int)message_len,
flags,
dest_addr,
addrlen);
Note. We have ported this application to Linux and the problem does not seem to appear there.
Any help would be greatly appreciated.
Regards
UDP send buffer size is different from TCP - it just limits the size of the datagram. Quoting Stevens UNP Vol. 1:
...
A UDP socket has a send buffer size (which we can change with SO_SNDBUF socket option, Section 7.5), but this is simply an upper limit on the maximum-sized UDP datagram that can be written to the socket. If an application writes a datagram larger than the socket send buffer size, EMSGSIZE is returned. Since UDP is unreliable, it does not need to keep a copy of the application's data and does not need an actual send buffer. (The application data is normally copied into a kernel buffer of some form as it passes down the protocol stack, but this copy is discarded by the datalink layer after the data is transmitted.)
UDP simply prepends 8-byte header and passes the datagram to IP. IPv4 or IPv6 prepends its header, determines the outgoing interface by performing the routing function, and then either adds the datagram to the datalink output queue (if it fits within the MTU) or fragments the datagram and adds each fragment to the datalink output queue. If a UDO application sends large datagrams (say 2,000-byte datagrams), there's a much higher probability of fragmentation than with TCP. because TCP breaks the application data into MSS-sized chunks, something that has no counterpart in UDP.
The successful return from write to a UDP socket tells us that either the datagram or all fragments of the datagram have been added to the datalink output queue. If there is no room on the queue for the datagram or one of its fragments, ENOBUFS is often returned to the application.
Unfortunately, some implementations do not return this error, giving the application no indication that the datagram was discarded without even being transmitted.
The last footnote needs attention - but it looks like Tru64 has this error code listed in the manual page.
The proper way of doing it though is to queue your outstanding messages in the application itself and to carefully check return values and the errno after each system call. This still does not guarantee delivery (since UDP receivers might drop the packets without any notice to the senders). Check the UDP packet discard counters with netstat -s on both/all sides, see if they are growing. There is really no way around this besides switching to TCP or implementing your own timeout/ack and re-transmission logic.
You should probably be using some sort of congestion control to avoid overloading the network. By far the easiest way to do this is to use TCP instead of UDP.
It fails less often on Linux because UDP sockets wait for space in the local network interface queue on Linux (unless you set them non-blocking). However, with any operating system, if the overfull queue is not in the local system, the packet will be dropped silently.

How to UDP Broadcast from Linux Kernel?

I'm developing a experimental Linux Kernel module, so...
How to UDP Broadcast from Linux Kernel?
-13 is -EACCES. Do you have SO_BROADCAST set? I believe sock_sendmsg returns -EACCES if SO_BROADCAST isn't set and you're sending to a broadcast address.
You're looking for <errno.h> for error codes.
What kernel version are you developing under? I'd like to browse thru the kernel source briefly. I'm not seeing how -ENOPKG can be returned from sock_set, but I do see that -ENOPROTOOPT can be returned (which is errno 92 in kernel 2.6.27).
Oh-- and repost that bit of code where you're setting SO_BROADCAST, if you would. I didn't make a note of it and I'd like to look at it again.
Try calling it with SOL_UDP. I think that's what you're looking for. I don't have a 2.6.18 build environment setup anywhere to play w/ this, but give that a shot.
No-- nevermind-- that's not going to do what you want. I should've read a little further in the source. I'll keep looking. This is kinda fun.
I suppose you could just set the broadcast flag yourself! smile
lock_sock(sock->sk);
sock->sk->broadcast = 1;
release_sock(sock->sk);
You've got me stumped, and I've got to head off to bed. I did find this bit of code that might be of some assistance, though these guys aren't doing broadcasts.
http://kernelnewbies.org/Simple_UDP_Server
Good luck-- I wish I could have solved it for you.
#adjuster..
Acctually, I just got it. When I'm setting SO_BROADCAST, I'm receiving 92 (Package not installed)
What package should I install, then?
Edit: The Kernel version is 2.6.18, and you are right! 92 is ENOPROTOOPT
//Socket creation
sock_create(AF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
//Broadcasting
int broadcast = 1;
int err;
if( (err = sock->ops->setsockopt(sock, SOL_SOCKET, SO_BROADCAST, (char *)&broadcast, sizeof broadcast)) < 0 )
{
printk(KERN_ALERT MODULE_NAME ": Could not configure broadcast, error %d\n", err);
return -1;
}
Edit: I've got this from setsockopt man page...
ENOPROTOOPT
The option is unknown at the level indicated.
...so, I supose that SOL_SOCKET isn't the right value to pass. I've also tried IPPROTO_UDP instead of SOL_SOCKET with no luck.
Edit: http://docs.hp.com/en/32650-90372/ch02s10.html says that SO_BROADCAST is an option of the SOL_SOCKET level, but I continue to get -92
Edit: I'm desperate, so I've tried SOL_UDP, still -92.
Yes, it is fun :) ... Good synergy! At the end (I hope we get there soon) let's assembly a definitive answer clean and nice! :)
Edit: Even if a hard set the broadcast flag, the sock_sendmsg will fail (-13, "Permission denied")
sock->sk->sk_flags |= SO_BROADCAST;
I really need some help on this one..
Mm, I wish I had more time to help you out.
To get UDP multicasting to work, it has to be baked into your kernel. You have to enable it when you configure your kernel. Google should have more info; I hope this puts you on the right track.
Look at the IPVS (linux virtual server) code in the Linux kernel. It already has a working implementation of UDP multicast, which it uses to share connection state for failover.
Having already taken a look at this and knowing some people who have done this, I would really recomend creating a netfilter link and using a userspace daemon to broadcast the information over the network.
The following worked for me (so finally this thread could be closed).
int yes = 1;
sock_setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &yes, sizeof(yes));
sock->ops->connect(sock, (struct sockaddr *)&addr, sizeof(struct sockaddr), 0);
Here sock is a initialized struct socket and addr should be struct sockaddr_in with a broadcast address in it.

Resources