Is it possible to modify a single member of a kernel struct in TCP? I want to be able to use setsockopt() to update a member of the tcp_info struct in TCP.
I've tried the following:
struct tcp_info info;
unsigned int optlen = sizeof(struct tcp_info);
if (getsockopt(sock, IPPROTO_TCP, TCP_INFO, &info, &optlen) < 0)
printf("Can't get data from getsockopt.\n");
info.retransmits += 10; // random member of tcp_info - as example
if (setsockopt(sock, IPPROTO_TCP, TCP_INFO, (char *) &info, optlen) < 0)
printf("Can't set data with setsockopt.\n");
The call to setsockopt() fails (returns a negative value).
The way I'm trying to solve it (above), given that it had worked - doesn't seem optimal. Is it possible to modify a members value from a struct, without having to fetch and update the entire struct (all of its members)?
You may not set arbitrary values with setsockopt(). It has a finite list of options you may set.
I'll use the FreeBSD kernel in this example, but all of this is similar if not identical in Linux. I will jump to FreBSD's sosetopt() function in sys/kern/uipc_socket.c.
The only valid options you may set are:
SO_ACCEPTFILTER, SO_LINGER, SO_DEBUG, SO_KEEPALIVE, SO_DONTROUTE, SO_USELOOPBACK, SO_BROADCAST, SO_REUSEADDR, SO_REUSEPORT, SO_REUSEPORT_LB, SO_OOBINLINE, SO_TIMESTAMP, SO_BINTIME, SO_NOSIGPIPE, SO_NO_DDP, SO_NO_OFFLOAD, SO_RERROR, SO_SETFIB, SO_USER_COOKIE, SO_SNDBUF, SO_RCVBUF, SO_SNDLOWAT, SO_RCVLOWAT, SO_SNDTIMEO, SO_RCVTIMEO, SO_LABEL, SO_TS_CLOCK, and SO_MAX_PACING_RATE.
That list contain a number of status flags, enabling or disabling features. There are only a few that allow setting of numerical values.
SO_USER_COOKIE - set a user-specified metadata value to a socket.
SO_SNDBUF/SO_RCVBUF - set the allocated buffer sizes for sending and receiving.
SO_SNDLOWAT/SO_RCVLOWAT - set a minimum amount of data to be sent/received per call.
SO_SNDTIMEO/SO_RCVTIMEO - set a timeout for sending/receiving calls.
SO_MAX_PACING_RATE - Instructs the network adapter to limit the transfer rate.
None of these write values directly to kernel structures. To accomplish something of the sort you have request, you will need to modify the kernel. Your other question addresses that objective.
Related
What are the ways of setting custom baudrates on Linux?
An answer to this question must be at a level of userland low-level APIs (ioctl, etc.) above the level of a syscall. It should be useful in these circumstances at least:
Writing low-level C-based userland code that uses serial ports,
Writing libraries that abstract the serial port functionality,
Writing kernel serial port drivers.
Things are, unfortunately, driver-dependent. Good drivers will implement all of the methods below. Bad drivers will implement only some of the methods. Thus you need to try them all. All of the methods below are implemented in the helper functions in linux/drivers/tty/serial/serial_core.c.
The following 4 choices are available.
Standard baud rates are set in tty->termios->c_cflag. You can choose from:
B0
B50
B75
B110
B134
B150
B200
B300
B600
B1200
B1800
B2400
B4800
B9600
B19200
B38400
B57600
B115200
B230400
If you need rates not listed above, e.g. 460800 (this is a deprecated hack that the kernel developers wish to die, per the source code comments):
set tty->termios->c_cflag speed to B38400
call TIOCSSERIAL ioctl with (struct serial_struct) set as follows:
serial->flags & ASYNC_SPD_MASK == ASYNC_SPD_[HI, VHI, SHI, WARP]
// this is an assertion, i.e. what your code must achieve, not how
This sets alternate speed to HI: 57600, VHI: 115200, SHI: 230400, WARP: 460800
You can set an arbitrary speed using alt_speed as follows:
Set tty->termios->c_cflag speed to B38400. This is unrelated to the speed you chose!
Set the intended speed in tty->alt_speed. It gets ignored when alt_speed==0.
You can also an arbitrary speed rate by setting custom divisor as follows:
Set tty->termios->c_cflag speed to B38400. This is unrelated to the speed you chose!
bool set_baudrate(int fd, long baudrate) {
struct termios term;
if (tcgetattr(fd, &term)) return false;
term.c_cflag &= ~(CBAUD | CBAUDEX);
term.c_cflag |= B38400;
if (tcsetattr(fd, TCSANOW, &term)) return false;
// cont'd below
Call TIOCSSERIAL ioctl with struct serial_struct set as follows:
serial->flags & ASYNC_SPD_MASK == ASYNC_SPD_CUST
serial->custom_divisor == serial->baud_base / your_new_baudrate
// these are assertions, i.e. what your code must achieve, not how
How to do it? First get the structure filled (including baud_base you need) by calling TIOCGSERIAL ioctl. Then modify it to indicate the new baudrate and set it with TIOCSSERIAL:
// cont'd
struct serial_struct serial;
if (ioctl(fd, TIOCGSERIAL, &serial)) return false;
serial->flags &= ~ASYNC_SPD_MASK;
serial->flags |= ASYNC_SPD_CUST;
serial->custom_divisor = serial->baud_base / baudrate.
if (ioctl(fd, TIOCSSERIAL, &serial)) return false;
return true;
}
I'm writing a netfilter module, that deeply inspect the packet. However, during tests I found that netfilter module is not receiving the packet in full.
To verify this, I wrote the following code to dump packet retrieved on port 80 and write the result to dmesg buffer:
const struct iphdr *ip_header = ip_hdr(skb);
if (ip_header->protocol == IPPROTO_TCP)
{
const struct tcphdr *tcp_header = tcp_hdr(skb);
if (ntohs(tcp_header->dest) != 80)
{
return NF_ACCEPT;
}
buff = (char *)kzalloc(skb->len * 10, GFP_KERNEL);
if (buff != NULL)
{
int pos = 0, i = 0;
for (i = 0; i < skb->len; i ++)
{
pos += sprintf(buff + pos, "%02X", skb->data[i] & 0xFF);
}
pr_info("(%pI4):%d --> (%pI4):%d, len=%d, data=%s\n",
&ip_header->saddr,
ntohs(tcp_header->source),
&ip_header->daddr,
ntohs(tcp_header->dest),
skb->len,
buff
);
kfree (buff);
}
}
In virtual machine running locally, I can retrieve the full HTTP request; On Alibaba cloud, and some other OpenStack based VPS provider, the packet is cut in the middle.
To verify this, I execute curl http://VPS_IP on another VPS, and I got the following output in dmesg buffer:
[ 1163.370483] (XXXX):5007 --> (XXXX):80, len=237, data=451600ED000040003106E3983D87A950AC11D273138F00505A468086B44CE19E80180804269300000101080A1D07500A000D2D90474554202F20485454502F312E310D0A486F73743A2033392E3130372E32342E37370D0A4163636570743A202A2F2A0D0A557365722D4167656E743A204D012000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000001E798090F5FFFF8C0000007B00000000E0678090F5FFFF823000003E00000040AE798090F5FFFF8C0000003E000000000000000000000000000000000000000000000000000000000000
When decoded, the result is like this
It's totally weird, everything after User-Agent: M is "gone" or zero-ed. Although the skb->len is 237, but half of the packet is missing.
Any ideas? Tried both PRE_ROUTING and LOCAL_IN, no changes.
It appears that sometimes you are getting a linear skb, and sometimes your skb is not linear. In the latter case you are not reading the full data contents of an skb.
If skb->data_len is zero, then your skb is linear and the full data contents of the skb is in skb->data. If skb->data_len is not zero, then your skb is not linear, and skb->data contains just the the first (linear) part of the data. The length of this area is skb->len - skb->data_len. skb_headlen() helper function calculates that for convenience. skb_is_nonlinear() helper function tells in an skb is linear or not.
The rest of the data can be in paged fragments, and in skb fragments, in this order.
skb_shinfo(skb)->nr_frags tells the number of paged fragments. Each paged fragment is described by a data structure in the array of structures skb_shinfo(skb)->frags[0..skb_shinfo(skb)->nr_frags]. skb_frag_size() and skb_frag_address() helper functions help dealing with this data. They accept the address of the structure that describes a paged fragment. There are other useful helper functions depending on your kernel version.
If the total size of data in paged fragments is less than skb->data_len, then the rest of the data is in skb fragments. It's the list of skb which is attached to this skb at skb_shinfo(skb)->frag_list (see skb_walk_frags() in the kernel).
Please note that there may be that there's no data in the linear part and/or there's no data in the paged fragments. You just need to process data piece by piece in the order just described.
What I mean atomic is success or failed and do nothing.
I know socketpair(AF_LOCAL, SOCK_STREAM) is not atomic, if multiple processes/threads call write(fd, buf, len), the return value of the write() maybe > 0 && < len and cause data out of order.
If multiple processes/threads write(buf, len) to a sock_fd which created by socketpair(AF_LOCAL, SOCK_SEQPACKET), is it atomic?
I check the Linux manual and found something about pipe() which says if the len is less than PIPE_BUF, the write/writev is atomic.
I found nothing about socketpair. I wrote a test code and found it seems that the SOCK_SEQPACKET is atomic, I write random length buffer to fd and the return value is always -1 or len.
Yes.
Any interface that is datagram based (i.e. - the size you pass to write is visible to the person doing the read) must be atomic. There is no other way to guarantee that property.
So SOCK_SEQPACKET, as well as SOCK_DGRAM, must be atomic in order to function.
For that very same reason, SOCK_STREAM has no such atomicy guarantees.
I've experienced a smashing stack (= buffer overflow) problem recently when trying to run iperf3. I pinpointed the reason to the getsockname() call (https://github.com/esnet/iperf/blob/master/src/net.c#L463) that makes the kernel copy more data (sizeof(sin_addr)) at the designed address (&sa) than the size of the variable on the stack at that address.
getsockname() redirects the call to getname() (AF_INET family) :
https://github.com/torvalds/linux/blob/master/net/ipv4/af_inet.c#L698
If I believe the manpage (ubuntu) it says:
int getsockname(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
The addrlen argument should be initialized to indicate the amount of space (in bytes) pointed to by addr. On return it contains the actual size of the socket address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
But in the previous code excerpt, getname() does not care about the addrlen input value and uses the parameter as an output value only.
I had found a link (can't find it anymore) saying that BSD respects the previous manpage excerpt contrary to linux.
Am I missing something? I find it awkward that the documentation would be that much off, I've checked other linux XXX_getname calls and all I saw didn't care about the input length.
Short answer
I believe that the addrlen value is not checked in kernel just to not waste some CPU cycles, because it should always be of known type (e.g. struct sockaddr), therefore it should always has known and fixed size (which is 16 bytes). So kernel just rewrites addrlen to 16, no matter what.
Regarding the issue you are having: I'm not sure why it's happening, but it doesn't actually seem that it's about size mismatch. I'm pretty sure kernel and userspace both have the same size of that structure which should be passed to getsockname() syscall (proof is below). So basically the situation you are describing here:
...that makes the kernel copy more data (sizeof(sin_addr)) at the designed address (&sa) than the size of the variable on the stack at that address
is not the case. I could only imagine how many application would fail if it was true.
Detailed explanation
Userspace side
In iperf sources you have next definition of sockaddr struct (/usr/include/bits/socket.h):
/* Structure describing a generic socket address. */
struct sockaddr
{
__SOCKADDR_COMMON (sa_); /* Common data: address family and length. */
char sa_data[14]; /* Address data. */
};
And __SOCKADDR_COMMON macro defined as follows (/usr/include/bits/sockaddr.h):
/* This macro is used to declare the initial common members
of the data types used for socket addresses, `struct sockaddr',
`struct sockaddr_in', `struct sockaddr_un', etc. */
#define __SOCKADDR_COMMON(sa_prefix) \
sa_family_t sa_prefix##family
And sa_family_t defined as:
/* POSIX.1g specifies this type name for the `sa_family' member. */
typedef unsigned short int sa_family_t;
So basically sizeof(struct sockaddr) is always 16 bytes (= sizeof(char[14]) + sizeof(short)).
Kernel side
In inet_getname() function you see that addrlen param is rewritten by next value:
*uaddr_len = sizeof(*sin);
where sin is:
DECLARE_SOCKADDR(struct sockaddr_in *, sin, uaddr);
So you see that sin has type of struct sockaddr_in *. This structure is defined as follows (include/uapi/linux/in.h):
/* Structure describing an Internet (IP) socket address. */
#define __SOCK_SIZE__ 16 /* sizeof(struct sockaddr) */
struct sockaddr_in {
__kernel_sa_family_t sin_family; /* Address family */
__be16 sin_port; /* Port number */
struct in_addr sin_addr; /* Internet address */
/* Pad to size of `struct sockaddr'. */
unsigned char __pad[__SOCK_SIZE__ - sizeof(short int) -
sizeof(unsigned short int) - sizeof(struct in_addr)];
};
So sin variable is also 16 bytes long.
UPDATE
I'll try to reply to your comment:
if getsockname wants to allocate an ipv6 instead that may be why it overflows the buffer
When calling getsockname() for AF_INET6 socket, kernel will figure (in getsockname() syscall, by sockfd_lookup_light() function) that inet6_getname() should be called to handle your request. In that case, uaddr_len will be assigned with next value:
struct sockaddr_in6 *sin = (struct sockaddr_in6 *)uaddr;
...
*uaddr_len = sizeof(*sin);
So if you are using sockaddr_in6 struct in your user-space program too, the size will be the same. Of course, if your userspace application is passing sockaddr structure to getsockname for AF_INET6 socket, there will be some sort of overflow (because sizeof(struct sockaddr_in6) > sizeof(struct sockaddr)). But I believe it's not the case for iperf3 tool you are using. And if it is -- it's iperf that should be fixed in the first place, and not the kernel.
I read this guide to write a kernel module to do simple network filtering.
First, I have no idea of what below text this means, and what's the difference between inbound and outbound data packet(by transportation layer)?
When a packet goes in from wire, it travels from physical layer, data
link layer, network layer upwards, therefore it might not go through
the functions defined in netfilter for skb_transport_header to work.
Second, I hate magic numbers, and I want to replace the 20 (the length of typical IP header) with any function from the linux kernel's utilities(source file).
Any help will be appreciated.
This article is a little outdated now. Text that you don't understand is only applicable to kernel versions below 3.11.
For new kernels (>= 3.11)
If you are sure that your code will only be used with kernels >= 3.11, you can use next code for both input and output packets:
udp_header = (struct udphdr *)skb_transport_header(skb);
Or more elegant:
udp_header = udp_hdr(skb);
It's because transport header is already set up for you in ip_rcv():
skb->transport_header = skb->network_header + iph->ihl*4;
This change was brought by this commit.
For old kernels (< 3.11)
Outgoing packets (NF_INET_POST_ROUTING)
In this case .transport_header field set up correctly in sk_buffer, so it points to actual transport layer header (UDP/TCP). So you can use code like this:
udp_header = (struct udphdr *)skb_transport_header(skb);
or better looking (but actually the same):
udp_header = udp_hdr(skb);
Incoming packets (NF_INET_PRE_ROUTING)
This is the tricky part.
In this case the .transport_header field is not set to the actual transport layer header (UDP or TCP) in sk_buffer structure (that you get in your netfilter hook function). Instead, .transport_header points to IP header (which is network layer header).
So you need to calculate address of transport header by your own. To do so you need to skip IP header (i.e. add IP header length to your .transport_header address). That's why you can see next code in the article:
udp_header = (struct udphdr *)(skb_transport_header(skb) + 20);
So 20 here is just the length of IP header.
It can be done more elegant in this way:
struct iphdr *iph;
struct udphdr *udph;
iph = ip_hdr(skb);
/* If transport header is not set for this kernel version */
if (skb_transport_header(skb) == (unsigned char *)iph)
udph = (unsigned char *)iph + (iph->ihl * 4); /* skip IP header */
else
udph = udp_hdr(skb);
In this code we use an actual IP header size (which is iph->ihl * 4, in bytes) instead of magic number 20.
Another magic number in the article is 17 in next code:
if (ip_header->protocol == 17) {
In this code you should use IPPROTO_UDP instead of 17:
#include <linux/udp.h>
if (ip_header->protocol == IPPROTO_UDP) {
Netfilter input/output packets explanation
If you need some reference about difference between incoming and outgoing packets in netfilter, see the picture below.
Details:
[1]: Some useful code from GitHub
[2]: "Linux Kernel Networking: Implementation and Theory" by Rami Rosen
[3]: This answer may be also useful