Are there any important change in how SLES 10 implements Tcp sockets vs. SLES 9?
I have several apps written in C# (.NET 3.5) that run on Windows XP and Windows Server 2003. They've been running fine for over a year, getting market data from a SLES 9 machine using a socket connection.
The machine was upgraded today to SLES 10 and its causing some strange behavior. The socket normally returns a few hundred or thousand bytes every second. But occasionally, I stop receiving data. Ten or more seconds will go by with no data and then Receive returns with a 10k+ bytes. And some buffer is causing data loss because the bytes I receive on the socket no longer make a correct packet.
The only thing changed was the SLES 9 to 10 upgrade. And rolling back fixes this immediately. Any ideas?
The dropped packets can be resolved by upgrading the smb kernel to 2.6.16.60-0.37 or later. The BNX2 kernel module is the root cause for the dropping packets. This is a known issue with SLES 10 out of the box.
Reference: http://www.novell.com/support/search.do?cmd=displayKC&sliceId=SAL_Public&externalId=7002506
The defaults for /proc/sys/net settings may have changed. Maybe newer SLES enables things like tcp_ecn?
If your network is dropping some packets it doesn't like with SLES10, then it's probably enabling newer TCP features. Otherwise I don't know. I'd look at it with tcpdump/wireshark. And maybe strace the server process to see what system calls it was doing.
SLES is the sender, so it's possible something could have changed that made it decide to wait until it had a full window of data or something. But 10k is too much. Sounds more like dropped packets, and then a large return when a missing packet finally arrives, allowing the queued up data to be returned too.
Related
I ran into issues with TCP Keep Alives in Virtual Box maybe someone can help me out here.
Our software runs inside a Linux VM running with Virtual Box on Windows.
From there a TCP connection is established where we try to check broken connections by enabeling TCP Keep Alives.
Since we want to detect broken connections fast we changed the defaults of those on OS level (since thats where you're supposed do this) and thats where we keep running into issues with Virtual Box.
I did a little investigation with a simple python script and wireshark and got some strange results.
The python script looks like this:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
print(s.getsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE)) # seconds
print(s.getsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL)) # seconds
print(s.getsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT))
s.connect(("10.33.42.0", 2112))
s.recv(16)
Pretty basic. I establish a tcp connection and block with the recv call. The other side never sends anything (I just use netcat) so that the keepalive would apply after the TCP_KEEPIDLE intervall.
Also I print out the keepalive settings to see if they were picked up correctly.
So now I ran this script in different environments and got pretty different results with it.
Running it on Plain Linux works. I can set the TCP_KEEPIDLE to 10 sec and wireshark shows that every 10 sec an TCP Keep Alive is sent.
Next I ran it inside WSL (Windows-Subsystem Linux) I use for development purposes and there the first thing came up that was strange:
The settings seemed to be picked up correctly (since the prints of script) resembeled the settings I did on OS level which had a TCP_KEEPIDLE of 10 sec, but when I looked into Wireshark i saw that a keepalive was sent every minute (no matter what I configured). I can't explain where this 10 sec intervall comes from (I assumed that it could be from the windows settings, but they were on default which would be a KEEPIDLE of 2 hours).
Then I ran it on our actual deployment (VM running Linux on Windows). The output of the scripted looked good again. Picked up the 10 sec of TCP_KEEPIDLE, but I never saw a single Keepalive message.
My first guess here again maybe the windows settings are used even though the script prints the Linux settings. The intervall would be 2 hours there by default, which would explain why I don't see any message in the 10 minutes I tested.
So I changed the settings on the Windows host which did not make a difference for the VM setup.
But what I did notice when running the python script directly on Windows it still printed the default settings even though when looking at Wireshark the keepalive worked as configured. So it seems that python does not always get the right settings when getting them with getsockopt.
I'm pretty much in a dead end here. Since I don't know what settings are applied in my two VM setups (WSL and Virtual Box).
Is there someone here that has a little more insight on what happens with Keepalive settings in VM Scenarios and where the settings come from there?
I've noticed within the past month or maybe two that whenever I resume my laptop from sleep, the sshd.exe cygwin process consumes 100% of a CPU core (i.e. it reports constant "25%" utilization on my 4-core machine.
I think but cannot confirm that this started with one of the major feature updates to Windows 10 that I installed recently.
The problem can be temporarily resolved by doing a HUP on the sshd process, but as soon as I suspend/resume it starts spinning again, which substantially increases power and rips through the battery on the machine (not to mention it gets hot as hell)..
suggestions on what to do here? I can't know at this point whether windows is the victim and cygwin/SSHD is the killer, or the other way around...
Versions:
Windows 10 Pro 20H2 build 19042.1052
OpenSSH_8.5p1, OpenSSL 1.1.1f 31 Mar 2020
In the past few days I have been doing extensive testing of Subversion with different clients, operating systems, client and server versions and have noticed very strange behaviour with windows clients connecting to Linux servers, hitting them with excessive CPU usage on the sshd process, where the Linux clients do not exhibit this behaviour.
A sample test setup is as follows:
Server Linux Ubuntu 16.04.3 LTS, OpenSSH_7.2p2 Ubuntu-4ubuntu2.2, OpenSSL 1.0.2g 1 Mar 2016, Subversion version 1.9.3 (and 1.9.7).
Client TortoiseSVN 1.9.7
When checking out large repositories the linux server is hit on the sshd process, the process running with 100% cpu usage. This in effect slows down the performance and ultimately the speed at which the checkout runs. Linux clients connecting to the same server do not cause this load on the server.
This happens even when compressions is turned off and when encryption Cyphers are changed, as well as different versions of subversion. The behaviour is identical. I'm not sure who to address for this issue as this not only happens with TortoiseSVN but with SlikSVN as well. Any direction would be appreciated.
If you're just looking for a way for a controlled set of users to access your SVN servers, an easy workaround for any Windows 10 users is to have them use SVN from WSL (Windows Subsystem for Linux). In fact, I would consider testing that route to isolate the client from the network stack, etc.
It's worth noting that default SVN settings may be to convert line endings, and the SVN server may be converting every file, every line to Windows default line endings.
Likely better answers out there, but those are my initial thoughts.
I use Red Hawk v2.1.0 to realize the AM demodulation part with three components.
Platform --> Xilinx Zynq 7035 (ARM Coretex A9*2)
Oparating System(OS)--> embedded Linux.
When connecting the RedHawk-IDE on the external PC with the Ether and displaying the waveform between the components, an abnormal sound is occured.
At this time, when I disconnect the LAN cable, the AM demodulation processing of Red Hawk inside the ARM will cease.
RedHawk inside the ARM appears to be waiting for requests from RedHawk-IDE on the external PC.
From this, it seems that abnormal noise will occur when requests from RedHawk-IDE on the external PC are delayed.
How can I keep RedHawk's AM demodulation processing inside the ARM running without stopping while connecting the RedHawk-IDE of the external PC and monitoring the waveform?
Environment is below.
CPU:Xilinx Zynq ARM CoretexA9 2cores 600MHz
OS:Embedded Linux Kernel 3.14 RealTimePatch
FrameLength:5.333ms(48kHz sampling, 256 data)
I have seen similar, if not identical issues, when running on an ARM board. Tracking down the exact issue may be difficult and in my experience hasn't been redhawk specific and has really been an issue with omniORB or its configuration. I believe one of the fixes for me was recompiling omniORB rather than using the omniORB package provided by my OS. (Which didn't make any sense to me at the time as I used the same flags & build process as the package maintainer)
First I would confirm this issue is specific to ARM. If it's easy enough to setup the same components, waveforms etc. on a 2nd x86_64 host and validate the problem does not occur.
Second I would try a "quick fix" of setting the omniORB timeouts on the arm host using the /etc/omniORB.cfg file and setting:
clientCallTimeOutPeriod = 2000
clientConnectTimeOutPeriod = 2000
This will set a 2 second timeout on CORBA interactions for both the connect portion and the call completion portion. In the past this has served as a quick fix for me but does not address the underlying issue. If this "fixes" it for you then you've at least narrowed part of the issue down and you could enable omniORB debugging using the traceLevel configuration option to find what call is timing out. See this sample configuration file for all options
If you want to dive into the underlying issues you'd need to see what the IDE and framework are doing when things lock up. With the IDE this is easy; simply find the PID of the java process and run kill -3 <pid> and a full stack trace will be printed in the terminal that is running the IDE. This can give you hints as to what calls are locked up. For the framework you'll need to use GDB and connect to the process in question and tell GDB to print the stack trace. You'd have to do some investigation ahead of time to determine which process is locking up.
If it ends up being an issue with the Java CORBA implementation on x86_64 talking with the C++ CORBA implementation on ARM you could also try launching / configuring / interacting with the ARM board via the REDHAWK python API from your x86_64 host. This may have better compatibility since they both use the same omniORB CORBA implementation.
This question is a bit tricky and I don't really think I'll find an answer but I'll try anyway.
I'm writing a C++ program using gloox for XMPP transfers. My problem , which is happening only on my computer, (Linux Mint 13 MATE 32bits) is that the "onConnect()" handler is never called. This was not a problem when we were using the jabber.org server (actually, I didn't even know it was not called...) but problems started occuring when we installed a LAN ejabberd server.
Using jabber.org server, even though "onConnect" was not called the application was able to send/receive messages fine. But not on ejabberd. At first I thought it was a problem with the certificate or something but then we tried on our other Linux boxes (Ubuntu 12.04 x64, Arch x64 and Debian 6.0 32(the server is running on this machine)) and it works fine on any of those computers. Plus, the sister application using Python-Twisted can connect fine on the problematic computer.
The validation function, onTLSConnect() is called everytime and it returns true. On the problematic computer, when using our ejabberd server, the connection isn't established after that and the socket closes it self after about 25 seconds ( and onDisconnect() is called...)
So, my question: Could there be an internet setting (like a firewall?) that is preventing Gloox to terminate the connection. OR has anyone experienced a similar issue?
Thanks!
EDIT: I made a VM of Mint 13 MATE 32bits on my laptop and the same problem arises. I can now conclude it's bug somewhere in mint.
EDIT2: Works fine on Mint 64 bits....I opened a ticket on Mint's bug page
i meet this problem last week, it seems a bug of gloox.
it happened on 32 bit linux.
see this https://bugs.launchpad.net/linuxmint/+bug/1071416
in fact.you are "online" on the server, but your "presence" state is unknown. you can simply send a "Chat" state to the server to continute your work.
like this
#ifdef GLOOX_ON_CONNECT_BUG_PATCH
Poco::Thread::sleep(3000);//休息3秒,然后连接成功,哎,该死的bug
this->is_connected = true;
client->setPresence(Presence::Chat, 0);
#endif