I ran into issues with TCP Keep Alives in Virtual Box maybe someone can help me out here.
Our software runs inside a Linux VM running with Virtual Box on Windows.
From there a TCP connection is established where we try to check broken connections by enabeling TCP Keep Alives.
Since we want to detect broken connections fast we changed the defaults of those on OS level (since thats where you're supposed do this) and thats where we keep running into issues with Virtual Box.
I did a little investigation with a simple python script and wireshark and got some strange results.
The python script looks like this:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
print(s.getsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE)) # seconds
print(s.getsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL)) # seconds
print(s.getsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT))
s.connect(("10.33.42.0", 2112))
s.recv(16)
Pretty basic. I establish a tcp connection and block with the recv call. The other side never sends anything (I just use netcat) so that the keepalive would apply after the TCP_KEEPIDLE intervall.
Also I print out the keepalive settings to see if they were picked up correctly.
So now I ran this script in different environments and got pretty different results with it.
Running it on Plain Linux works. I can set the TCP_KEEPIDLE to 10 sec and wireshark shows that every 10 sec an TCP Keep Alive is sent.
Next I ran it inside WSL (Windows-Subsystem Linux) I use for development purposes and there the first thing came up that was strange:
The settings seemed to be picked up correctly (since the prints of script) resembeled the settings I did on OS level which had a TCP_KEEPIDLE of 10 sec, but when I looked into Wireshark i saw that a keepalive was sent every minute (no matter what I configured). I can't explain where this 10 sec intervall comes from (I assumed that it could be from the windows settings, but they were on default which would be a KEEPIDLE of 2 hours).
Then I ran it on our actual deployment (VM running Linux on Windows). The output of the scripted looked good again. Picked up the 10 sec of TCP_KEEPIDLE, but I never saw a single Keepalive message.
My first guess here again maybe the windows settings are used even though the script prints the Linux settings. The intervall would be 2 hours there by default, which would explain why I don't see any message in the 10 minutes I tested.
So I changed the settings on the Windows host which did not make a difference for the VM setup.
But what I did notice when running the python script directly on Windows it still printed the default settings even though when looking at Wireshark the keepalive worked as configured. So it seems that python does not always get the right settings when getting them with getsockopt.
I'm pretty much in a dead end here. Since I don't know what settings are applied in my two VM setups (WSL and Virtual Box).
Is there someone here that has a little more insight on what happens with Keepalive settings in VM Scenarios and where the settings come from there?
Related
On my root server, debian 7 is the operation system. Running kernel is 2.6.32.
I have the problem, that TCP/IP-connections seems to be "unstable".
ssh connections often hang or timeout. Webserver sometimes runs fast, sometimes the client (browser) is waiting and waiting for a response.
I dont know where to start right now for this problems. I made a hardware check requests at my ISP ticket system.
Is there a hint you can give me?
I would guess it is DNS related. I would isolate one destination to communicate to, and one location to communicate from. Using ping, I would determine if there was latency in the link itself (ping from the server to one destination and from your client workstation to the troubled server). Once you determine that times are predictable (no **'s)I would learn both IP addresses and put them in /etc/hosts.
When you run ssh I would consider using -vv to see what it is doing and maybe that will help.
The problem was a broken network adapter in the mainframe of the cluster. The provided fixed the issue.
I currently have a tiny, headless (and I certainly want to keep it that way :) ) Linux Virtual Machine set up with Vagrant and VirtualBox which, for testing, I want to run an X11 application (Firefox) whose output comes to Xming on my real machine. All that's hunky dory, working perfectly, but I'm not happy yet!
What I want to be able to do is do a few setup things, make sure everything's running correctly, then disconnect from the server and let the testing run it's course. If however something goes wrong, or I want to just check the current status of things (some of the tests may run into hours), I'd like to then hop back onto the server and point the X11 output to my machine again. But despite a good deal of Google-ing and learning loads about X11 that I didn't know a few hours ago, I can't find anything about choosing where the output of an X11 application goes, except at startup, ie;
DISPLAY=:10 firefox &
I had read some random blog post that Xephyr XServer did this (kind of act as an intermediate X11 buffer, which then redirects if you want it to, otherwise just outputs to /dev/null), but I can't find any other reference to it, or anything else doing that.
There's a program called Xpra that works sort of like "screen" but for X-sessions. It would start a separate X session from the main one, for the remote access, but you can connect/disconnect to it at will from the host machine.
http://www.xpra.org/
I currently have one acceptable way to do this, which will serve my purpose, I have a vnc4server running that takes firefox's output, and then I can connect and disconnect to that without any issue at all, just like a normal VNC server. This allows me to do what I want to do, but not how I want to do it. I'd like to be able to do this without the need of a VNC server at all.
I have installed andlinux Beta 2 on my WinXP. Everything works fine until last night, I don't recall that I ever changed anything on network configuration or andlinux setup, the network stop working inside andlinux. With that said, I mean open a KDE console, I do "ping yahoo.com", I see DNS is resolved correctly, however, no response at all.
My andlinux is startup as a WinXP service. Open windows task manager I can see following services are up and running:colinux-daemon.exe colinux-net-daemon.exe colinux-slirp-net-daemon.exe
On andlinux side, there are two network interface eth0 and eth1. eth1 is configured to communicate with local WinXP. I configured it to use samba to access windows directories, no problem. From WinXP side, I can use ssh to login into andlinux box via eth1 IP address.
eth0 is configured as slirp, no port forwarding. eth0 has IP=10.0.2.15, default gateway is 10.0.2.2, netmask=255.255.255.0; These are configured in /etc/network/interfaces. DNS is 10.0.2.3, which as I just mentioned resolve yahoo.com correctly.
On the windows side, internet works fine. I disabled firewall on all network interface. I rebooted my laptop, no luck. I searched over inet, seem no one has this problem. People say network is done if they kill the colinux-slirp-net-daemon. What frustrated me is that this whole thing worked well, but for no reason it's broken all the sudden. Anyone has experience on this issue, please help, appreciate!
I thought I had the same problem, but then found my andLinux system's network connectivity was actually working fine, and that several things made it difficult to tell what was going on.
Test I did to validate connectivity: wget www.yahoo.com
Behavior I observed that made troubleshooting difficult:
Pings from andLinux - not all hosts will respond to pings from the andLinux OS (ie Ubuntu, not the Host Windows OS). According to my packet captures the pings appear as UDP pings instead of ICMP pings once they leave the host OS's adapter. The major IPs/hosts (like yahoo, google, 4.2.2.2 etc.) on the internet I usually ping to test connectivity currently don't respond to these type of pings.
Traceroutes from andLinux - even when successful, these never show more than 2 hops when done from the andLinux OS. If successful, both hops show 10.0.2.2. If unsuccessful, the second hop just times out. Not sure why, I'm sure there is an explanation.
Packet captures - at the host OS level, the capture (eg wireshark) must be done on the physical interface the traffic is going over. I was initially capturing on the TAP-Win32 Adapter but this only showed X Window traffic.
Installed apt sources URLs no longer valid - Ubuntu 9.04 is long out of support by now, so the URLs in the apt sources.list file didn't exist anymore. This is what got me thrown off in the first place, because I didn't troubleshoot this specifically and just tried to test my internet connectivity first, then got confused by the ping and traceroute behavior seen above. Changed http://us.archive.ubuntu.com/ubuntu to http://old-releases.ubuntu.com/ubuntu/ in sources.list and was good to go.
Good day.
I need an application or script on Linux firewall which will monitor and close idle TCP connections to one site after some minutes. Till now I found cutter and Killcx, but this is command prompt tools and no idle detection.
Do you know any app or maybe firewall implementation for Linux with everything that I want in one package?
Thanks.
Perhaps you can hack something with "conntrack" iptables module. Especially the --ctexpire match. See man iptables.
Are there any important change in how SLES 10 implements Tcp sockets vs. SLES 9?
I have several apps written in C# (.NET 3.5) that run on Windows XP and Windows Server 2003. They've been running fine for over a year, getting market data from a SLES 9 machine using a socket connection.
The machine was upgraded today to SLES 10 and its causing some strange behavior. The socket normally returns a few hundred or thousand bytes every second. But occasionally, I stop receiving data. Ten or more seconds will go by with no data and then Receive returns with a 10k+ bytes. And some buffer is causing data loss because the bytes I receive on the socket no longer make a correct packet.
The only thing changed was the SLES 9 to 10 upgrade. And rolling back fixes this immediately. Any ideas?
The dropped packets can be resolved by upgrading the smb kernel to 2.6.16.60-0.37 or later. The BNX2 kernel module is the root cause for the dropping packets. This is a known issue with SLES 10 out of the box.
Reference: http://www.novell.com/support/search.do?cmd=displayKC&sliceId=SAL_Public&externalId=7002506
The defaults for /proc/sys/net settings may have changed. Maybe newer SLES enables things like tcp_ecn?
If your network is dropping some packets it doesn't like with SLES10, then it's probably enabling newer TCP features. Otherwise I don't know. I'd look at it with tcpdump/wireshark. And maybe strace the server process to see what system calls it was doing.
SLES is the sender, so it's possible something could have changed that made it decide to wait until it had a full window of data or something. But 10k is too much. Sounds more like dropped packets, and then a large return when a missing packet finally arrives, allowing the queued up data to be returned too.