TCP messages getting coalesced - linux

I have a Java application that is writing to the network. It is writing messages in the region of 764b, +/- 5b. A pcap shows that the stream is getting IP fragmented and we can't explain this.
Linux 2.6.18-238.1.1.el5
A strace shows:
(strace -vvvv -f -tt -o strace.out -e trace=network -p $PID)
1: 2045 12:48:23.984173 sendto(45, "\0\0\0\0\0\0\2\374\0\0\0\0\0\3\n\0\0\0\0\3upd\365myData"..., 764, 0, NULL, 0) = 764
2: 15206 12:48:23.984706 sendto(131, "\0\0\0\0\0\0\2\374\0\0\0\0\0\3\n\0\0\0\0\3upd\365myData"..., 764, 0, NULL, 0 <unfinished ...>
3: 2046 12:48:23.984811 sendto(46, "\0\0\0\0\0\0\2\374\0\0\0\0\0\3\n\0\0\0\0\3upd\365myData"..., 764, 0, NULL, 0 <unfinished ...>
4: 15206 12:48:23.984893 <... sendto resumed> ) = 764
5: 2046 12:48:23.984948 <... sendto resumed> ) = 764
I am seeing packets larger than the MTU when I capture the network, which is causing fragmentation.
4809 5.848987 10.0.0.2 -> 10.0.0.5 TCP 40656 > taiclock [ACK] Seq=325501 Ack=1 Win=46 Len=1448 TSV=344627654 TSER=270108068 # First Fragment
4810 5.848991 10.0.0.5 -> 10.0.0.2 TCP taiclock > 40656 [ACK] Seq=1 Ack=326949 Win=12287 Len=0 TSV=270108081 TSER=344627643 # TCP ack
4811 5.849037 10.0.0.2 -> 10.0.0.5 TCP 40656 > taiclock [PSH, ACK] Seq=326949 Ack=1 Win=46 Len=82 TSV=344627654 TSER=270108081 # Second Frag
Questions:
1) It appears the server trying to batch the two sendto() into one IP packet, which is larger than the MTU and is therefore getting fragmented. Why?
2) Looking at the strace output for PID 2046, is the figure after the equal sign <... sendto resumed> line a total for what was sent? I.e. 764b was sent in total for line 3 and line 5? Or is 764 bytes being sent per line?
3) Are there any options I can pass to strace to log all of the sendto() output? Can't seem to find anything..

To answer your questions, in order:
1) It is perfectly normal for multiple send calls to be coalesced when using TCP as it is a stream protocol so does not preserve user level send boundaries in any way. I don't see any evidence of IP fragmentation (which would be bad) in your trace, just of TCP segmentation (which is completely normal).
2) Yes, that is the size - more specifically it is reporting the value that the system call returned after it resumed.
3) You can use "-e write=all" or "-e write=" to get strace to report the whole of the written data.

Related

How to detect CDP by tcpdump

I would like to ask you for help: Does somebody know how to detect Cisco Discovery Protocol via tcpdump?
Currently I'm using following command, but I'm not sure by this:
tcpdump -i eth0 -nn "ether[20:2]==0x2000"
Some hints are appreciated. Thank you ...
Charkh
I normally use this filters
tcpdump -nvi bce0 -s 1500 ether dst 01:00:0c:cc:cc:cc
replace bce0 with your network interface.
This will output the hole CDP information, received from ether the switch or the host itself (if you have a cdpd running on the host)
This will output Switch-Name, Port, Switch Type, Software, VLAN and so on...
the output will look similar to this:
$tcpdump -nvi bce0 -s 1500 ether dst 01:00:0c:cc:cc:cc
tcpdump: WARNING: bce0: no IPv4 address assigned
tcpdump: listening on bce0, link-type EN10MB (Ethernet), capture size 1500 bytes
11:43:24.327197 DTPv1, length 39
Domain TLV (0x0001) TLV, length 18, domain-internal
Status TLV (0x0002) TLV, length 5, 0x81
DTP type TLV (0x0003) TLV, length 5, 0xa5
Neighbor TLV (0x0004) TLV, length 10, 6c:50:4d:06:64:01
11:43:44.820865 CDPv2, ttl: 180s, checksum: 692 (unverified), length 477
Device-ID (0x01), length: 40 bytes: 'my-switch.mydomain.net'
Version String (0x05), length: 247 bytes:
Cisco IOS Software, CBS30X0 Software (CBS30X0-IPBASEK9-M), Version 12.2(58)SE1, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2011 by Cisco Systems, Inc.
Compiled Thu 05-May-11 03:57 by prod_rel_team
Platform (0x06), length: 20 bytes: 'cisco WS-CBS3020-HPQ'
Address (0x02), length: 13 bytes: IPv4 (1) 1.2.3.4
Port-ID (0x03), length: 18 bytes: 'GigabitEthernet0/1'
Capability (0x04), length: 4 bytes: (0x00000028): L2 Switch, IGMP snooping
Protocol-Hello option (0x08), length: 32 bytes:
VTP Management Domain (0x09), length: 13 bytes: 'doman-internal'
Native VLAN ID (0x0a), length: 2 bytes: 358
Duplex (0x0b), length: 1 byte: full
AVVID trust bitmap (0x12), length: 1 byte: 0x00
AVVID untrusted ports CoS (0x13), length: 1 byte: 0x00
Management Addresses (0x16), length: 13 bytes: IPv4 (1) [IP]
unknown field type (0x1a), length: 12 bytes:
0x0000: 0000 0001 0000 0000 ffff ffff
I use the following command:
tcpdump -nn -v -xx -i eth? -s 1500 -c 1 'ether dst 01:00:0c:cc:cc:cc and (ether[24:2] = 0x2000 or ether[20:2] = 0x2000)'
Where eth? is your ethernet adapter.
It can be used with IBM SEA over trunked connections or over standard copper connections.

Linux Iptables string module does not match all packets

This is the first time I'm using matching string module for iptables and met some strange behaviour I can't overcome.
In short, it looks like iptables passes "trimmed" packets to module, so module can not parse whole packet for string.
I cant find any information about such behaviour on net. All examples and tutors should work like a charm, but they dont.
Now in details.
OS: debian testing, kernel 3.2.0-3-686-pae
IPTABLES: iptables v1.4.14
OTHER:
tcpdump version 4.3.0,
libpcap version 1.3.0
# lsmod|grep ipt
ipt_LOG 12533 0
iptable_nat 12800 0
nf_nat 17924 1 iptable_nat
nf_conntrack_ipv4 13726 3 nf_nat,iptable_nat
nf_conntrack 43121 3 nf_conntrack_ipv4,nf_nat,iptable_nat
iptable_filter 12488 0
ip_tables 17079 2 iptable_filter,iptable_nat
x_tables 18121 6
ip_tables,iptable_filter,iptable_nat,xt_string,xt_tcpudp,ipt_LOG
I'v reseted all rules to default and added only two rules in following order:
iptables -t filter -A OUTPUT --protocol tcp --dport 80 --match string --algo bm --from 0 --to 1500 --string "/index.php" --jump LOG --log-prefix "matched :"
iptables -t filter -A OUTPUT --protocol tcp --dport 80 --jump LOG --log-prefix "OUT : "
The idea is obious - match requests to any ip to port 80 containing string /index.php and print them to log, and also log all the data passed to 80 port.
So the iptables -L -xvn :
Chain INPUT (policy ACCEPT 3 packets, 693 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 3 packets, 184 bytes)
pkts bytes target prot opt in out source destination
0 0 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 STRING match "/index.php" ALGO name bm TO 1500 LOG flags 0 level 4 prefix "matched :"
0 0 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 LOG flags 0 level 4 prefix "OUT : "
Counters for rules are zeroed.
Ok, now browser goes to www.gentoo.org/index.php. This url is just example for illustration.
So this is the only request url in browser.
And I get the following for iptables -t filter -L -xvn:
Chain INPUT (policy ACCEPT 61 packets, 16657 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 63 packets, 4394 bytes)
pkts bytes target prot opt in out source destination
1 380 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 STRING match "/index.php" ALGO name bm TO 1500 LOG flags 0 level 4 prefix "matched :"
13 1392 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 LOG flags 0 level 4 prefix "OUT : "
So we have only ONE match for 1st rule. But that is wrong. Here is tcpdump for output for connection.
First ip request 89.16.167.134:
<handshake omitted>
16:56:38.440308 IP 192.168.66.106.54704 > 89.16.167.134.80: Flags [P.], seq 1:373, ack 1, win 913,
options [nop,nop,TS val 85359 ecr 3026115253], length 372
<...>
0x0030: b45e dab5 4745 5420 2f69 6e64 6578 2e70 .^..GET./index.p
0x0040: 6870 2048 5454 502f 312e 310d 0a48 6f73 hp.HTTP/1.1..Hos
0x0050: 743a 2077 7777 2e67 656e 746f 6f2e 6f72 t:.www.gentoo.or
0x0060: 670d 0a55 7365 722d 4167 656e 743a 204d g..User-Agent:.M
0x0070: 6f7a 696c 6c61 2f35 2e30 2028 5831 313b ozilla/5.0.(X11;
Here we see one match in HTTP GET. Ok some packets later we have request for some more content to ip 140.211.166.176:
16:56:38.772432 IP 192.168.66.106.59766 > 140.211.166.176.80: Flags [P.], seq 1:329, ack 1, win 913,
options [nop,nop,TS val 85442 ecr 110101513], length 328
<...>
0x0030: 0690 0409 4745 5420 2f20 4854 5450 2f31 ....GET./.HTTP/1
<...>
0x0130: 6566 6c61 7465 0d0a 436f 6e6e 6563 7469 eflate..Connecti
0x0140: 6f6e 3a20 6b65 6570 2d61 6c69 7665 0d0a on:.keep-alive..
0x0150: 5265 6665 7265 723a 2068 7474 703a 2f2f Referer:.http://
0x0160: 7777 772e 6765 6e74 6f6f 2e6f 7267 2f69 www.gentoo.org/i
0x0170: 6e64 6578 2e70 6870 0d0a 0d0a ndex.php....
Here we see "/index.php" again.
But LOG rule gives the following info:
kernel: [ 641.386182] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=89.16.167.134 LEN=60
kernel: [ 641.435946] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=89.16.167.134 LEN=52
kernel: [ 641.436226] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=89.16.167.134 LEN=424
kernel: [ 641.512594] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=89.16.167.134 LEN=52
kernel: [ 641.512762] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=89.16.167.134 LEN=52
kernel: [ 641.512819] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=89.16.167.134 LEN=52
kernel: [ 641.567496] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=60
kernel: [ 641.767707] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=52
kernel: [ 641.768328] matched :IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=380 <--
kernel: [ 641.768352] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=380
kernel: [ 641.990287] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=52
kernel: [ 641.990455] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=52
kernel: [ 641.990507] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=52
kernel: [ 641.990559] OUT : IN= OUT=eth0 SRC=192.168.66.106 DST=140.211.166.176 LEN=52
So we have match on only one packet, going to 140.211.166.176. But where is the first match?
The more strange for me is that on ither machine with ubuntu, it gives different counters, like 6 matches for string.
I'v also made same simple test on home notebook with ubuntu 12.04 and i get 2 clear matches.
Mb there is some kind of options to tune module data passing or smth?
UPDATE:
Very simple experiment mb if someone could explain that moment it will give a hint.
Now two pc, server centos 6.3 which listens on port 80 with netcat ans answers string:
# echo "List-IdServer"|nc -l 80
List-Id
And client (debian) connecting to server and sending data with nc and reading answer:
% echo "List-Id"|nc butorabackup 80
List-IdServer
After data exchange i have on SERVER SIDE:
Chain INPUT (policy ACCEPT 33 packets, 2164 bytes)
pkts bytes target prot opt in out source destination
1 60 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 STRING match "List-Id" ALGO name bm TO 6500 LOG flags 0 level 4
5 276 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 LOG flags 0 level 4
Chain FORWARD (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 23 packets, 4434 bytes)
pkts bytes target prot opt in out source destination
0 0 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp spt:80 STRING match "List-Id" ALGO name bm TO 6500 LOG flags 0 level 4
5 282 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp spt:80 LOG flags 0 level 4
and on CLIENT SIDE:
Chain INPUT (policy ACCEPT 28 packets, 2187 bytes)
pkts bytes target prot opt in out source destination
1 66 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp spt:80 STRING match "List-Id" ALGO name bm TO 6500 LOG flags 0 level 4
5 282 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp spt:80 LOG flags 0 level 4
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 23 packets, 1721 bytes)
pkts bytes target prot opt in out source destination
1 60 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 STRING match "List-Id" ALGO name bm TO 6500 LOG flags 0 level 4
5 276 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 LOG flags 0 level 4
So there is NO output rule match for server. Server matched IN, and client matched IN and OUT.
I dont understand how does it work :(
Tnx in advance.

how does the tcp packet “size” reported by tcpdump correlate to the actual IP packet sent?

When I run a tcpdump – what exactly does the number reported between the () after the lowest and highest sequence number mean?
If it correlates to the IP packet sent, how can it be larger than
the mtu of 1500?
On the receive side of this flow tcpdump always
sees packets of 1448 or smaller – where are these packets being
defragmented and reassembled? Can I see that with tcpdump?
Is there any relationship between this number and the size of the tcp socket
buffer?
I.e: Typical tcpdump on the sender:
11:47:18.278352 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 3263771:3272459(8688) ack 1 win 54
11:47:18.278371 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 3272459:3282595(10136) ack 1 win 54
11:47:18.278620 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 3282595:3298523(15928) ack 1 win 54
11:47:18.278727 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 3298523:3301419(2896) ack 1 win 54
11:47:18.278731 IP 1.1.1.36.41034 > 1.1.1.37.60960: P 3301419:3301719(300) ack 1 win 54
11:47:18.777160 IP 1.1.1.36.41034 > 1.1.1.37.60960: P 3301719:3301723(4) ack 1 win 54
11:47:18.777175 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 3301723:3303171(1448) ack 1 win 54
And the receiver:
11:47:18.277895 IP 1.1.1.36.41034 > 1.1.1.37.60960: P 990413:990417(4) ack 1 win 54
11:47:18.277948 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 990417:991865(1448) ack 1 win 54
11:47:18.277953 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 991865:993313(1448) ack 1 win 54
11:47:18.277958 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 993313:994761(1448) ack 1 win 54
11:47:18.278028 IP 1.1.1.36.41034 > 1.1.1.37.60960: . 994761:996209(1448) ack 1 win 54
That is the length of the data, as you can see if you substract the two sequence numbers (after accounting +1 for SYN and FIN flags, if they are present in the packet).
As to why it's greater than the MTU, either your netword card can send jumbo-frames (quite normal on 1Gbps cards), or your card can accept large segments and divide them on hardware (called LSO/GSO or some similar acronym I cannot seem to recall).
Edit: See Large Segment Offload on Wikipedia.

List of possible internal socket statuses from /proc

I would like to know the possible values of st column in /proc/net/tcp. I think the st column equates to STATE column from netstat(8) or ss(8).
I have managed to identify three codes:
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
0: 0100007F:08A0 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 7321 1 ffff81002f449980 3000 0 0 2 -1
1: 00000000:006F 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 6656 1 ffff81003a30c080 3000 0 0 2 -1
2: 00000000:0272 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 6733 1 ffff81003a30c6c0 3000 0 0 2 -1
3: 0100007F:0277 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 7411 1 ffff81002f448d00 3000 0 0 2 -1
4: 0100007F:0019 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 7520 1 ffff81002f4486c0 3000 0 0 2 -1
5: 0100007F:089F 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 7339 1 ffff81002f449340 3000 0 0 2 -1
6: 0100007F:E753 0100007F:0016 01 00000000:00000000 02:000AFA92 00000000 500 0 18198 2 ffff81002f448080 204 40 20 2 -1
7: 0100007F:E752 0100007F:0016 06 00000000:00000000 03:000005EC 00000000 0 0 0 2 ffff81000805dc00
The above shows:
On line sl 0: a listening port on tcp/2208. st = 0A = LISTEN
On line sl 6: An established session on tcp/22. st = 01 = ESTABLISHED
On line sl 7: An socket in TIME_WAIT state after ssh logout. No inode. st = 06 = TIME_WAIT
Can anyone expand on this list? The proc(5) manpage is quite terse on the subject stating:
/proc/net/tcp
Holds a dump of the TCP socket table. Much of the information is not of use apart from debugging. The "sl" value is the kernel hash slot for the socket, the "local address" is the local address and
port number pair. The "remote address" is the remote address and port number pair (if connected). ’St’ is the internal status of the socket. The ’tx_queue’ and ’rx_queue’ are the outgoing and incom-
ing data queue in terms of kernel memory usage. The "tr", "tm->when", and "rexmits" fields hold internal information of the kernel socket state and are only useful for debugging. The "uid" field
holds the effective UID of the creator of the socket.
And on a related note, the above /proc/net/tcp output is showing a few listening processes (2208, 62, 111 etc). However, I cannot see a listening tcp connection on tcp/22, althought the established and time_wait states are shown. Yes, I can see them in /proc/net/tcp6 but should they not be present in /proc/net/tcp also? Netstat output shows it differently to applications bound only to ipv4. E.g.
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 4231/portmap
tcp 0 0 :::22 :::* LISTEN 4556/sshd
Many thanks,
-Andrew
They should match to the enum in ./include/net/tcp_states.h in the linux kernel sources:
enum {
TCP_ESTABLISHED = 1,
TCP_SYN_SENT,
TCP_SYN_RECV,
TCP_FIN_WAIT1,
TCP_FIN_WAIT2,
TCP_TIME_WAIT,
TCP_CLOSE,
TCP_CLOSE_WAIT,
TCP_LAST_ACK,
TCP_LISTEN,
TCP_CLOSING, /* Now a valid state */
TCP_MAX_STATES /* Leave at the end! */
};
As for your 2. question, are you really sure there's not an sshd listening on e.g. 0.0.0.0:22 ? If not, I suspect what you're seeing is related to v4-mapped-on-v6 sockets, see e.g. man 7 ipv6

wireshark and tcpdump -r: strange tcp window sizes

I'm capturing http traffic with tcpdump and am interested in TCP slow start and how window sizes increase:
$ sudo tcpdump -i eth1 -w wget++.tcpdump tcp and port 80
When I view the dump file with Wireshark the progression of window sizes looks normal, i.e. 5840, 5888, 5888, 8576, 11264, etc...
But when I view the dump file via
$ tcpdump -r wget++.tcpdump -tnN | less
I get what seem to be nonsensical windows sizes ( IP addresses omitted for brevity ):
: S 1069713761:1069713761(0) win 5840 <mss 1460,sackOK,timestamp 24220583 0,nop,wscale 7>
: S 1198053215:1198053215(0) ack 1069713762 win 5672 <mss 1430,sackOK,timestamp 2485833728 24220583,nop,wscale 6>
: . ack 1 win 46 <nop,nop,timestamp 24220604 2485833728>
: . 1:1419(1418) ack 1 win 46 <nop,nop,timestamp 24220604 2485833728>
: P 1419:2002(583) ack 1 win 46 <nop,nop,timestamp 24220604 2485833728>
: . ack 1419 win 133 <nop,nop,timestamp 2485833824 24220604>
: . ack 2002 win 178 <nop,nop,timestamp 2485833830 24220604>
Is there a way to get normal / absolute window sizes on the command line?
The window sizes are correct - they're just unscaled.
The connection initiator has set a wscale (window scaling factor) of 7, so its subsequent win values must be multiplied by 128 to get the window size in bytes. Thus the win 46 indicates a window of 5888 bytes.
The connection recipient has set a wscale of 6, so its win values must be multiplied by 64. Thus win 133 indicates a window of 8512 bytes, and win 178 indicates 11392 bytes.
Also, if the tool (wireshark or tcpdump, it doesn't matter) doesn't see the syn, it has to print the unscaled value, which can fool you

Resources