force socket disconnect without forging RST, Linux - linux

I have a network client which is stuck in recvfrom a server not under my control which, after 24+ hours, is probably never going to respond. The program has processed a great deal of data, so I don't want to kill it; I want it to abandon the current connection and proceed. (It will do so correctly if recvfrom returns EOF or -1.) I have already tried several different programs that purport to be able to disconnect stale TCP channels by forging RSTs (tcpkill, cutter, killcx); none had any effect, the program remained stuck in recvfrom. I have also tried taking the network interface down; again, no effect.
It seems to me that there really should be a way to force a disconnect at the socket-API level without forging network packets. I do not mind horrible hacks, up to and including poking kernel data structures by hand; this is a disaster-recovery situation. Any suggestions?
(For clarity, the TCP channel at issue here is in ESTABLISHED state according to lsof.)

I do not mind horrible hacks
That's all you have to say. I am guessing the tools you tried didn't work because they sniff traffic to get an acceptable ACK number to kill the connection. Without traffic flowing they have no way to get hold of it.
Here are things you can try:
Probe all the sequence numbers
Where those tools failed you can still do it. Make a simple python script and with scapy, for each sequence number send a RST segment with the correct 4-tuple (ports and addresses). There's at most 4 billion (actually fewer assuming a decent window - you can find out the window for free using ss -i).
Make a kernel module to get hold of the socket
Make a kernel module getting a list of TCP sockets: look for sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[i].chain)
Identify your victim sk
At this point you intimately have access to your socket. So
You can call tcp_reset or tcp_disconnect on it. You won't be able to call tcp_reset directly (since it doesn't have EXPORT_SYMBOL) but you should be able to mimic it: most of the functions it calls are exported
Or you can get the expected ACK number from tcp_sk(sk) and directly forge a RST packet with scapy
Here is function I use to print established sockets - I scrounged bits and pieces from the kernel to make it some time ago:
#include <net/inet_hashtables.h>
#define NIPQUAD(addr) \
((unsigned char *)&addr)[0], \
((unsigned char *)&addr)[1], \
((unsigned char *)&addr)[2], \
((unsigned char *)&addr)[3]
#define NIPQUAD_FMT "%u.%u.%u.%u"
extern struct inet_hashinfo tcp_hashinfo;
/* Decides whether a bucket has any sockets in it. */
static inline bool empty_bucket(int i)
{
return hlist_nulls_empty(&tcp_hashinfo.ehash[i].chain);
}
void print_tcp_socks(void)
{
int i = 0;
struct inet_sock *inet;
/* Walk hash array and lock each if not empty. */
printk("Established ---\n");
for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
struct sock *sk;
struct hlist_nulls_node *node;
spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, i);
/* Lockless fast path for the common case of empty buckets */
if (empty_bucket(i))
continue;
spin_lock_bh(lock);
sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[i].chain) {
if (sk->sk_family != PF_INET)
continue;
inet = inet_sk(sk);
printk(NIPQUAD_FMT":%hu ---> " NIPQUAD_FMT
":%hu\n", NIPQUAD(inet->inet_saddr),
ntohs(inet->inet_sport), NIPQUAD(inet->inet_daddr),
ntohs(inet->inet_dport));
}
spin_unlock_bh(lock);
}
}
You should be able to pop this into a simple "Hello World" module and after insmoding it, in dmesg you will see sockets (much like ss or netstat).

I understand that what you want to do it's to automatize the process to make a test. But if you just want to check the correct handling of the recvfrom error, you could attach with the GDB and close the fd with close() call.
Here you could see an example.
Another option is to use scapy for crafting propper RST packets (which is not in your list). This is the way I tested the connections RST in a bridged system (IMHO is the best option), you could also implement a graceful shutdown.
Here an example of the scapy script.

Related

How to flush raw AF_PACKET socket to get correct filtered packets

sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &f, sizeof (f))
With this simple BPF/LPF attach code, when I try to receive packet on the socket, will get some wrong packets that doesn't match with the filter. Seems those packets got into the socket before I call setsockopt().
Seems like should first create the AF_PACKET SOCK_RAW socket, then attach the filter, then flush the socket to get rid of those wrong packets.
So the question is, how to flush those packet?
The "bug" you're describing is real and I've seen it at multiple companies in my career. There is something like an "oral tradition" around this bug that is passed from one network engineer to another. Here are the common fixes:
Just call recv on the socket until it is empty
Double-filter by filtering packets in usermode as well as using the bpf
Use the zero-bpf technique just like libpcap where you apply an empty bpf first, then empty the socket, and then apply the real bpf.
I've written about this problem extensively on my blog to try and codify the oral tradition around this bug into a concrete recommendation and best-practice.

Linux UART slower than specified Baudrate

I'm trying to communicate between two Linux systems via UART.
I want to send large chunks of data. With the specified Baudrate it should take around 5 seconds, but it takes nearly 10 times the expected time.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between. If I measure the time needed for the drain and the number of bytes written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
I would expect a slower transmission as the optimal, but not this much.
Did I miss something while setting the UART or while writing? Or is this normal?
The code used for setup:
int bus = open(interface.c_str(), O_RDWR | O_NOCTTY | O_NDELAY); // <- also tryed blocking
if (bus < 0) {
return;
}
struct termios options;
memset (&options, 0, sizeof options);
if(tcgetattr(bus, &options) != 0){
close(bus);
bus = -1;
return;
}
cfsetspeed (&options, B230400);
cfmakeraw(&options); // <- also tried this manually. did not make a difference
if(tcsetattr(bus, TCSANOW, &options) != 0)
{
close(bus);
bus = -1;
return;
}
tcflush(bus, TCIFLUSH);
The code used to send:
int32_t res = write(bus, data, dataLength);
while (res < dataLength){
tcdrain(bus); // <- taking way longer than expected
int32_t r = write(bus, &data[res], dataLength - res);
if(r == 0)
break;
if(r == -1){
break;
}
res += r;
}
B230400
The docs are contradictory. cfsetspeed is documented as requiring a speed_t type, while the note says you need to use one of the "B" constants like "B230400." Have you tried using an actual speed_t type?
In any case, the speed you're supplying is the baud rate, which in this case should get you approximately 23,000 bytes/second, assuming there is no throttling.
The speed is dependent on hardware and link limitations. Also the serial protocol allows pausing the transmission.
FWIW, according to the time and speed you listed, if everything works perfectly, you'll get about 1 MB in 50 seconds. What speed are you actually getting?
Another "also" is the options structure. It's been years since I've had to do any serial I/O, but IIRC, you need to actually set the options that you want and are supported by your hardware, like CTS/RTS, XON/XOFF, etc.
This might be helpful.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between.
You have only provided code snippets (rather than a minimal, complete, and verifiable example), so your data size is unknown.
But the Linux kernel buffer size is known. What do you think it is?
(FYI it's 4KB.)
If I measure the time needed for the drain and the number of bytese written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
You're confusing throughput with baudrate.
The maximum throughput (of just payload) of an asynchronous serial link will always be less than the baudrate due to framing overhead per character, which could be two of the ten bits of the frame (assuming 8N1). Since your termios configuration is incomplete, the overhead could actually be three of the eleven bits of the frame (assuming 8N2).
In order to achieve the maximum throughput, the tranmitting UART must saturate the line with frames and never let the line go idle.
The userspace program must be able to supply data fast enough, preferably by one large write() to reduce syscall overhead.
Did I miss something while setting the UART or while writing?
With Linux, you have limited access to the UART hardware.
From userspace your program accesses a serial terminal.
Your program accesses the serial terminal in a sub-optinal manner.
Your termios configuration appears to be incomplete.
It leaves both hardware and software flow-control untouched.
The number of stop bits is untouched.
The Ignore modem control lines and Enable receiver flags are not enabled.
For raw reading, the VMIN and VTIME values are not assigned.
Or is this normal?
There are ways to easily speed up the transfer.
First, your program combines non-blocking mode with non-canonical mode. That's a degenerate combination for receiving, and suboptimal for transmitting.
You have provided no reason for using non-blocking mode, and your program is not written to properly utilize it.
Therefore your program should be revised to use blocking mode instead of non-blocking mode.
Second, the tcdrain() between write() syscalls can introduce idle time on the serial link. Use of blocking mode eliminates the need for this delay tactic between write() syscalls.
In fact with blocking mode only one write() syscall should be needed to transmit the entire dataLength. This would also minimize any idle time introduced on the serial link.
Note that the first write() does not properly check the return value for a possible error condition, which is always possible.
Bottom line: your program would be simpler and throughput would be improved by using blocking I/O.

Change congestion control algorithms per connection

The command 'sysctl' in linux as of now changes the congestion control algorithm globally for the entire system. But congestion control, where the TCP window size and other similar parameters are varied, are normally done per TCP connection. So my question is:
Does there exist a way where I can change the congestion control algorithm being used per TCP connection?
Or am I missing something trivial here? If so, what is it?
This is done in iperf using the -Z option - the patch is here.
This is how it is implemented (PerfSocket.cpp, line 93) :
if ( isCongestionControl( inSettings ) ) {
#ifdef TCP_CONGESTION
Socklen_t len = strlen( inSettings->mCongestion ) + 1;
int rc = setsockopt( inSettings->mSock, IPPROTO_TCP, TCP_CONGESTION,
inSettings->mCongestion, len);
if (rc == SOCKET_ERROR ) {
fprintf(stderr, "Attempt to set '%s' congestion control failed: %s\n",
inSettings->mCongestion, strerror(errno));
exit(1);
}
#else
fprintf( stderr, "The -Z option is not available on this operating system\n");
#endif
Where mCongestion is a string containing the name of the algorithm to use
It seems this is possible via get/setsockopt. The only documentation i found is:
http://lkml.indiana.edu/hypermail/linux/net/0811.2/00020.html
In newer versions of Linux it is possible to set the congestion control for a specific destination using ip route ... congctl .
If anyone are familiar with this approach, please edit this post.
As far as I know, there's no way to change default TCP congestion control per process (I'd love bash script being able to say that whatever is executed by this script should default the congestion control lp).
The only user mode API I'm aware of is as follows:
setsockopt(socket, SOL_TCP, TCP_CONGESTION, congestion_alg, strlen(congestion_alg));
where socket is an open socket, and congestion_alg is a string containing one of the words in /proc/sys/net/ipv4/tcp_allowed_congestion_control.
Linux has pluggable congestion algorithms which can change the algorithm used on the fly but this is a system wide setting not per connection.

sendto on Tru64 is returning ENOBUF

I am currently running an old system on Tru64 which involves lots of UDP sockets using the sendto() function. The sockets are used in our code to send messages to/from various processes and then eventually on to a thick client app that is connected remotely. Occasionally the socket to the thick client gets stuck, this can cause some of these messages to get built up. My question is how can I determine the current buffer size, and how do I determine the maximum message buffer. The code below gives a snippet of how I set up the port and use the sendto function.
/* need to adjust the maximum size we can send on this */
/* as it needs to be able to cope with the biggest */
/* messages we send */
lenlen = sizeof(len) ;
/* allow double for when the system is under load */
int lenlen, len ;
lenlen = sizeof(len) ;
len = 2 * 32000;
msg_socket = socket( AF_UNIX,SOCK_DGRAM, 0);
result = setsockopt(msg_socket, SOL_SOCKET, SO_SNDBUF, (char *)&len, lenlen) ;
result = sendto( msg_socket,
(char *)message,
(int)message_len,
flags,
dest_addr,
addrlen);
Note. We have ported this application to Linux and the problem does not seem to appear there.
Any help would be greatly appreciated.
Regards
UDP send buffer size is different from TCP - it just limits the size of the datagram. Quoting Stevens UNP Vol. 1:
...
A UDP socket has a send buffer size (which we can change with SO_SNDBUF socket option, Section 7.5), but this is simply an upper limit on the maximum-sized UDP datagram that can be written to the socket. If an application writes a datagram larger than the socket send buffer size, EMSGSIZE is returned. Since UDP is unreliable, it does not need to keep a copy of the application's data and does not need an actual send buffer. (The application data is normally copied into a kernel buffer of some form as it passes down the protocol stack, but this copy is discarded by the datalink layer after the data is transmitted.)
UDP simply prepends 8-byte header and passes the datagram to IP. IPv4 or IPv6 prepends its header, determines the outgoing interface by performing the routing function, and then either adds the datagram to the datalink output queue (if it fits within the MTU) or fragments the datagram and adds each fragment to the datalink output queue. If a UDO application sends large datagrams (say 2,000-byte datagrams), there's a much higher probability of fragmentation than with TCP. because TCP breaks the application data into MSS-sized chunks, something that has no counterpart in UDP.
The successful return from write to a UDP socket tells us that either the datagram or all fragments of the datagram have been added to the datalink output queue. If there is no room on the queue for the datagram or one of its fragments, ENOBUFS is often returned to the application.
Unfortunately, some implementations do not return this error, giving the application no indication that the datagram was discarded without even being transmitted.
The last footnote needs attention - but it looks like Tru64 has this error code listed in the manual page.
The proper way of doing it though is to queue your outstanding messages in the application itself and to carefully check return values and the errno after each system call. This still does not guarantee delivery (since UDP receivers might drop the packets without any notice to the senders). Check the UDP packet discard counters with netstat -s on both/all sides, see if they are growing. There is really no way around this besides switching to TCP or implementing your own timeout/ack and re-transmission logic.
You should probably be using some sort of congestion control to avoid overloading the network. By far the easiest way to do this is to use TCP instead of UDP.
It fails less often on Linux because UDP sockets wait for space in the local network interface queue on Linux (unless you set them non-blocking). However, with any operating system, if the overfull queue is not in the local system, the packet will be dropped silently.

How to UDP Broadcast from Linux Kernel?

I'm developing a experimental Linux Kernel module, so...
How to UDP Broadcast from Linux Kernel?
-13 is -EACCES. Do you have SO_BROADCAST set? I believe sock_sendmsg returns -EACCES if SO_BROADCAST isn't set and you're sending to a broadcast address.
You're looking for <errno.h> for error codes.
What kernel version are you developing under? I'd like to browse thru the kernel source briefly. I'm not seeing how -ENOPKG can be returned from sock_set, but I do see that -ENOPROTOOPT can be returned (which is errno 92 in kernel 2.6.27).
Oh-- and repost that bit of code where you're setting SO_BROADCAST, if you would. I didn't make a note of it and I'd like to look at it again.
Try calling it with SOL_UDP. I think that's what you're looking for. I don't have a 2.6.18 build environment setup anywhere to play w/ this, but give that a shot.
No-- nevermind-- that's not going to do what you want. I should've read a little further in the source. I'll keep looking. This is kinda fun.
I suppose you could just set the broadcast flag yourself! smile
lock_sock(sock->sk);
sock->sk->broadcast = 1;
release_sock(sock->sk);
You've got me stumped, and I've got to head off to bed. I did find this bit of code that might be of some assistance, though these guys aren't doing broadcasts.
http://kernelnewbies.org/Simple_UDP_Server
Good luck-- I wish I could have solved it for you.
#adjuster..
Acctually, I just got it. When I'm setting SO_BROADCAST, I'm receiving 92 (Package not installed)
What package should I install, then?
Edit: The Kernel version is 2.6.18, and you are right! 92 is ENOPROTOOPT
//Socket creation
sock_create(AF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
//Broadcasting
int broadcast = 1;
int err;
if( (err = sock->ops->setsockopt(sock, SOL_SOCKET, SO_BROADCAST, (char *)&broadcast, sizeof broadcast)) < 0 )
{
printk(KERN_ALERT MODULE_NAME ": Could not configure broadcast, error %d\n", err);
return -1;
}
Edit: I've got this from setsockopt man page...
ENOPROTOOPT
The option is unknown at the level indicated.
...so, I supose that SOL_SOCKET isn't the right value to pass. I've also tried IPPROTO_UDP instead of SOL_SOCKET with no luck.
Edit: http://docs.hp.com/en/32650-90372/ch02s10.html says that SO_BROADCAST is an option of the SOL_SOCKET level, but I continue to get -92
Edit: I'm desperate, so I've tried SOL_UDP, still -92.
Yes, it is fun :) ... Good synergy! At the end (I hope we get there soon) let's assembly a definitive answer clean and nice! :)
Edit: Even if a hard set the broadcast flag, the sock_sendmsg will fail (-13, "Permission denied")
sock->sk->sk_flags |= SO_BROADCAST;
I really need some help on this one..
Mm, I wish I had more time to help you out.
To get UDP multicasting to work, it has to be baked into your kernel. You have to enable it when you configure your kernel. Google should have more info; I hope this puts you on the right track.
Look at the IPVS (linux virtual server) code in the Linux kernel. It already has a working implementation of UDP multicast, which it uses to share connection state for failover.
Having already taken a look at this and knowing some people who have done this, I would really recomend creating a netfilter link and using a userspace daemon to broadcast the information over the network.
The following worked for me (so finally this thread could be closed).
int yes = 1;
sock_setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &yes, sizeof(yes));
sock->ops->connect(sock, (struct sockaddr *)&addr, sizeof(struct sockaddr), 0);
Here sock is a initialized struct socket and addr should be struct sockaddr_in with a broadcast address in it.

Resources