For a UDP socket, sendto() will attempt to bind an ephemeral port. For a Unix domain socket of datagram type, since there is no port concept, only a path address, does sendto() try to come up with a random path(which needs to be backed by a real file in the fs), or a random abstract path(such as '#blah'), and then binds to it?
I am asking because on my machine I see these datagram Unix socket pairs in 'ESTAB' state, and I wonder how are these endpoints identified if the address here is '*', which I guess is a NULL string?
# ss -xp | grep dev-log
u_dgr ESTAB 0 0 /run/systemd/journal/dev-log 15236 * 0 users:(("systemd-journal",pid=254,fd=3),("systemd",pid=1,fd=36))
# ss -xp | grep 15236
u_dgr ESTAB 0 0 /run/systemd/journal/dev-log 15236 * 0 users:(("systemd-journal",pid=254,fd=3),("systemd",pid=1,fd=36))
u_dgr ESTAB 0 0 * 19250 * 15236 users:(("dbus-daemon",pid=369,fd=14))
u_dgr ESTAB 0 0 * 21686 * 15236 users:(("dbus-daemon",pid=701,fd=10))
A related question is, what are those numbers in place of port numbers mean, in unix domain socket world?
Related
I was under the impression that under Linux you could bind to a non-local address as long as you set the IP_FREEBIND socket option, but that's not the behavior I'm seeing:
$ sudo strace -e 'trace=%network' ...
...
socket(AF_INET, SOCK_RAW, IPPROTO_UDP) = 5
setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(5, SOL_SOCKET, SO_NO_CHECK, [1], 4) = 0
setsockopt(5, SOL_IP, IP_HDRINCL, [1], 4) = 0
setsockopt(5, SOL_IP, IP_FREEBIND, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(abcd), sin_addr=inet_addr("w.x.y.z")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
...
I also set the ip_nonlocal_bind setting, just to be certain, and I get the same results.
$ sysctl net.ipv4.ip_nonlocal_bind
net.ipv4.ip_nonlocal_bind = 1
Unfortunately, it seems that it is not possible to bind a raw IP socket to a non-local, non-broadcast and non-multicast address, regardless of IP_FREEBIND. Since I see inet_addr("w.x.y.z") in your strace output, I assume that this is exactly what you're trying to do and w.x.y.z is a non-local unicast address, thus your bind syscall fails.
This seems in accordance with man 7 raw:
A raw socket can be bound to a specific local address using the
bind(2) call. If it isn't bound, all packets with the specified
IP protocol are received. In addition, a raw socket can be bound
to a specific network device using SO_BINDTODEVICE; see socket(7).
Indeed, looking at the kernel source code, in raw_bind() we can see the following check:
ret = -EADDRNOTAVAIL;
if (addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL &&
chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST)
goto out;
Also, note that .sin_port must be 0. The .sin_port field for raw sockets is used to select a sending/receiving IP protocol (not a port, since we are at level 3 and ports do not exist). As the manual states, from Linux 2.2 onwards you cannot select a sending protocol through .sin_port anymore, the sending protocol is the one set when creating the socket.
Development setup:
AMD 3700X on a B450 motherboard
2 x intel T210 1Gb NICs (one port each, connected to one another)
Ubuntu 20.04
linux kernel 5.6.19-050619-generic
DPDK version stable-20.11.3
$ sudo usertools/dpdk-devbind.py --status
Network devices using DPDK-compatible driver
============================================
0000:06:00.0 'I210 Gigabit Network Connection 1533' drv=uio_pci_generic unused=vfio-pci
Network devices using kernel driver
===================================
0000:04:00.0 'I210 Gigabit Network Connection 1533' if=enigb1 drv=igb unused=vfio-pci,uio_pci_generic *Active*
0000:05:00.0 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller 8168' if=enp5s0 drv=r8169 unused=vfio-pci,uio_pci_generic *Active*
One NIC(00:04:00.0) provides traffic (88 packets captured from a live device) in linux mode (nothing to do with DPDK):
$sudo tcpreplay -q -p 1000 -l 5 -i enigb1 random_capture_realtek_dec_14.pcapng 2>&1 | grep -v "PF_PACKET\|Warning"
Actual: 440 packets (96555 bytes) sent in 0.440001 seconds
Rated: 219442.6 Bps, 1.75 Mbps, 999.99 pps
Statistics for network device: enigb1
Successful packets: 440
Failed packets: 5
Truncated packets: 0
Retried packets (ENOBUFS): 0
Retried packets (EAGAIN): 0
The other NIC (00:06:00.0) has dpdk-testpmd running on it:
sudo ./build/app/dpdk-testpmd -c 0xf000 -n 2 --huge-dir=/mnt/huge-2M -a 06:00.0 -- --portmask=0x3
EAL: Detected 16 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: 3 hugepages of size 1073741824 reserved, but no mounted hugetlbfs found for that size
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Probe PCI driver: net_e1000_igb (8086:1533) device: 0000:06:00.0 (socket 0)
EAL: No legacy callbacks, legacy socket not created
testpmd: create a new mbuf pool <mb_pool_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.
Configuring Port 0 (socket 0)
Port 0: 68:05:CA:E3:05:A2
Checking link statuses...
Done
No commandline core given, start packet forwarding
io packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP allocation mode: native
Logical Core 13 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
io packet forwarding packets/burst=32
nb forwarding cores=1 - nb forwarding ports=1
port 0: RX queue number: 1 Tx queue number: 1
Rx offloads=0x0 Tx offloads=0x0
RX queue: 0
RX desc=512 - RX free threshold=32
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=512 - TX free threshold=0
TX threshold registers: pthresh=8 hthresh=1 wthresh=16
TX offloads=0x0 - TX RS bit threshold=0
Press enter to exit
Port 0: link state change event
Telling cores to stop...
Waiting for lcores to finish...
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 153 RX-dropped: 287 RX-total: 440
TX-packets: 0 TX-dropped: 0 TX-total: 0
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 153 RX-dropped: 287 RX-total: 440
TX-packets: 0 TX-dropped: 0 TX-total: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Done.
No matter how much packets I pump thru the pipeline, at most 153 packets are processed, the rest is dropped and nothing is sent. I'd guess those 153 packets clog the TX queue in the 00:06:00.0 and that's it.
I tried dpdk-pktgen and it sent nothing as well.
Note: the NIC used to send the packets (tcpreplay) is the same model as the one used for dpdk-testpmd.
What am I doing wrong, or what am I not doing (and I should)?
Update 1:
testpmd> set promisc all on
testpmd> start tx_first
io packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP allocation mode: native
(...)
testpmd> show port xstats 0
(...)
rx_good_packets: 153
tx_good_packets: 0
rx_good_bytes: 31003
tx_good_bytes: 0
rx_missed_errors: 287
(...)
rx_total_packets: 440
tx_total_packets: 285
rx_total_bytes: 97035
tx_total_bytes: 17100
tx_size_64_packets: 0
tx_size_65_to_127_packets: 0
tx_size_128_to_255_packets: 0
tx_size_256_to_511_packets: 0
tx_size_512_to_1023_packets: 0
tx_size_1023_to_max_packets: 0
testpmd> show fwd stats all
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 153 RX-dropped: 287 RX-total: 440
TX-packets: 0 TX-dropped: 0 TX-total: 0
----------------------------------------------------------------------------
For example, in /proc/net/sockstat , does a TCP socket in CLOSE_WAIT get counted as 'inuse' or 'alloc' ?
In the kernel source net/ipv4/proc.c I see that sockstat_seq_show is called when getting the info from /proc/net/sockstat.
However I cannot see what differentiates a socket from being allocated (alloc) opposed to 'inuse'
[me#myhostname ~]$ cat /proc/net/sockstat
sockets: used 481
TCP: inuse 52 orphan 1 tw 66 alloc 62 mem 12
UDP: inuse 11 mem 5
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
In net/tcp_states.h the possible states are enumerated as such
enum {
TCP_ESTABLISHED = 1,
TCP_SYN_SENT,
TCP_SYN_RECV,
TCP_FIN_WAIT1,
TCP_FIN_WAIT2,
TCP_TIME_WAIT,
TCP_CLOSE,
TCP_CLOSE_WAIT,
TCP_LAST_ACK,
TCP_LISTEN,
TCP_CLOSING, /* Now a valid state */
TCP_NEW_SYN_RECV,
TCP_MAX_STATES /* Leave at the end! */
};
Which of the above count as 'inuse' and which count as 'alloc' ?
Which of the above count as 'inuse' and which count as 'alloc' ?
You already got close to the answer by locating sockstat_seq_show - we can see that 'inuse' is the value of sock_prot_inuse_get(net, &tcp_prot), and 'alloc' is the value of proto_sockets_allocated_sum_positive(&tcp_prot). Now it's not always easy to follow the call chain further down, but I, if not mistaken, arrive at the following conclusions.
'alloc' - This at bottom is the sum of percpu_counter tcp_sockets_allocated, which gets incremented in tcp_init_sock(); there the socket state is initialized to TCP_CLOSE. Whatever state changes the socket undergoes during its existence, 'alloc' doesn't depend on - all TCP states count as 'alloc'.
'inuse' - This is the sum of the (per CPU) counters net->core.inuse or prot_inuse (for the TCP in this case), which essentially get incremented and decremented by calls of sock_prot_inuse_add(…, 1) resp. (…, -1) in inet_hash() resp. inet_unhash(). The condition in inet_hash() is if (sk->sk_state != TCP_CLOSE), so all TCP states except TCP_CLOSE count as 'inuse'.
I think this means in theory any socket in a state >= TCP_CLOSE is not counted as 'inuse'
In my view that can't be so, since also TCP_LISTEN > TCP_CLOSE, and a socket in TCP_LISTEN state surely is counted as 'inuse', as can be seen with e. g.
(cd /proc/net; cat sockstat; nc -l 8888& sleep 1; cat sockstat; kill $!; cat sockstat)|grep TCP
I'm trying to set tcp keepalive but in doing so I'm seeing the error
"Protocol not available"
int rc = setsockopt(s, SOL_SOCKET, TCP_KEEPIDLE, &keepalive_idle, sizeof(keepalive_idle));
if (rc < 0)
printf("error setting keepalive_idle: %s\n", strerror(errno));
I'm able to turn on keepalive, set keepalive interval and count but keepalive idle which is keepalive time is throwing that error and I never see any keepalive packets being transmitted/received either with wireshark and the filter tcp.analysis.keep_alive or with tcpdump
sudo tcpdump -vv "tcp[tcpflags] == tcp-ack and less 1"
Is there a kernel module that needs to be loaded or something? Or are you no longer able to override the global KEEPIDLE time.
By the way the output of
matt#devpc:~/ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
In an application that I coded, the following works:
setsockopt(*sfd, SOL_SOCKET, SO_KEEPALIVE,(char *)&enable_keepalive, sizeof(enable_keepalive));
setsockopt(*sfd, IPPROTO_TCP, TCP_KEEPCNT, (char *)&num_keepalive_strobes, sizeof(num_keepalive_strobes));
setsockopt(*sfd, IPPROTO_TCP, TCP_KEEPIDLE, (char *)&keepalive_idle_time_secs, sizeof(keepalive_idle_time_secs));
setsockopt(*sfd, IPPROTO_TCP, TCP_KEEPINTVL, (char *)&keepalive_strobe_interval_secs, sizeof(keepalive_strobe_interval_secs));
Try changing SOL_SOCKET to IPPROTO_TCP for TCPKEEPIDLE.
There a very handy lib that can help you, it's called libkeepalive : http://libkeepalive.sourceforge.net/
It can be used with LD_PRELOAD in order to enable and control keep-alive on all TCP sockets. You can also override keep-alive settings with environment variables.
I tried to run a tcp server with it:
KEEPIDLE=5 KEEPINTVL=5 KEEPCNT=100 LD_PRELOAD=/usr/lib/libkeepalive.so nc -l -p 4242
Then I connected a client:
nc 127.0.0.1 4242
And I visualized the traffic with Wireshark: the keep-alive packets began exactly after 5 seconds of inactivity (my system wide setting is 75). Therefore it means that it's possible to override the system settings.
Here is how libkeepalive sets TCP_KEEPIDLE:
if((env = getenv("KEEPIDLE")) && ((optval = atoi(env)) >= 0)) {
setsockopt(s, SOL_TCP, TCP_KEEPIDLE, &optval, sizeof(optval));
}
Looks like they use SOL_TCP instead of SOL_SOCKET.
How can I get Haskell to listen for UDP and TCP on the same port?
Here is the code I have so far (based on acme-http):
listenOn portm = do
protoTCP <- getProtocolNumber "tcp"
E.bracketOnError
(socket AF_INET Stream protoTCP)
sClose
(\sock -> do
setSocketOption sock ReuseAddr 1
setSocketOption sock NoDelay 1
bindSocket sock (SockAddrInet (fromIntegral portm) iNADDR_ANY)
listen sock (max 1024 maxListenQueue)
return sock
)
protoUDP <- getProtocolNumber "udp"
E.bracketOnError
(socket AF_INET Datagram protoUDP)
sClose
(\sock -> do
setSocketOption sock ReuseAddr 1
bindSocket sock (SockAddrInet (fromIntegral portm) iNADDR_ANY)
return sock
)
I compiles fine, but I get the follow runtime error:
user error (accept: can't perform accept on socket ((AF_INET,Datagram,17)) in status Bound)
Unfortunately, documentation on network programming in Haskell is a bit limited (as usual). I don't really know where I'm supposed to look to figure this stuff out.
[UPDATE]
For anyone is interested, here is the result:
https://github.com/joehillen/acme-sip/blob/master/Acme/Serve.hs
I realize there is a lot of room for improvement, but it works.
There doesn't seem to be anything wrong with this code, but somewhere else your code seems to be calling accept() on the UDP socket, which isn't legal. All you need to do with a UDP socket is receive from it and send with it.