net.DialTCP produces "connection refused" error on Linux but not on Windows - linux

Code
To reproduce requires two application running and connecting to each other through TCP. So I've made a tiny repo that also includes the powershell build script. link to the full repo
However to avoid the extra click, here is the code for clientA.go.
package main
import (
"fmt"
"net"
"time"
)
func main() {
clientA, err := net.ResolveTCPAddr("tcp4", fmt.Sprintf(":%v", "2222"))
if err != nil {
fmt.Println(err)
return
}
clientB, err := net.ResolveTCPAddr("tcp4", fmt.Sprintf(":%v", "3333"))
if err != nil {
fmt.Println(err)
return
}
for {
clientAtoB, err := net.DialTCP("tcp4", clientA, clientB)
if err != nil {
fmt.Println(err)
} else {
defer clientAtoB.Close()
clientAtoB.SetLinger(0)
clientAtoB.SetNoDelay(true)
clientAtoB.SetKeepAlive(false)
fmt.Println("connected as Client A!")
buffer := make([]byte, 64)
_, err = clientAtoB.Read(buffer)
if err != nil {
continue
}
}
time.Sleep(time.Second)
}
}
The code for clientB.go is identical except the local and remote endpoints are swapped around:
clientBtoA, err := net.DialTCP("tcp4", clientB, clientA)
Problem
I build the same go code for both Windows and Linux but at runtime the applications produce different results. Specifically with how TCP connections are dialed on each platform.
On Windows, when I run the two executables clientA.exe and clientB.exe (built from the build.ps1 script) I get the desired result. As seen in this screenshot:
However when I upload and execute the Linux binaries, the result is different:
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ sudo chmod +x clientA clientB
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ ls -la
total 10984
drwxrwxr-x 3 ubuntu ubuntu 4096 Apr 27 03:09 .
drwxrwxr-x 4 ubuntu ubuntu 4096 Apr 27 03:08 ..
drwxrwxr-x 8 ubuntu ubuntu 4096 Apr 27 03:08 .git
-rw-rw-r-- 1 ubuntu ubuntu 11255 Apr 27 03:12 A.txt
-rw-rw-r-- 1 ubuntu ubuntu 11255 Apr 27 03:12 B.txt
-rw-rw-r-- 1 ubuntu ubuntu 247 Apr 27 03:08 build.ps1
-rwxrwxr-x 1 ubuntu ubuntu 2950662 Apr 27 03:08 clientA
-rw-rw-r-- 1 ubuntu ubuntu 2642944 Apr 27 03:08 clientA.exe
-rw-rw-r-- 1 ubuntu ubuntu 718 Apr 27 03:08 clientA.go
-rwxrwxr-x 1 ubuntu ubuntu 2950662 Apr 27 03:08 clientB
-rw-rw-r-- 1 ubuntu ubuntu 2642944 Apr 27 03:08 clientB.exe
-rw-rw-r-- 1 ubuntu ubuntu 718 Apr 27 03:08 clientB.go
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ ./clientA > A.txt & ./clientB > B.txt &
[1] 24914
[2] 24915
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ cat A.txt
dial tcp4 :2222->:3333: connect: connection refused
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ cat B.txt
dial tcp4 :3333->:2222: connect: connection refused
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$
I don't expect the connection refused error since these two applications are running under the same environment, so no firewalls are in effect, and the permissions are identical.
How can I get the same result regardless of platform? Or why are they different in the first place?
Edit
The successful connection on Windows is not just the luck of good timing. On Windows, I can run A for 5 minutes, then when I run B, both connect successfully.
Update (2020-04-27)
After receiving feedback from Go developers, I've been told that this is likely a Linux configuration issue and not specific to Go. Other than permissions, I can't thing of anything that would prevent two applications in the same environment from establishing a TCP connection like this? (These low level Linux stuff isn't really my forte.)

Why this doesn't work on Linux is quite obvious. Both A and B are clients that are connecting to counterpart that needs to listen. On Linux (or UNIX) if you try to run ClientA it will try to dial in to ClientB's address and port. If there's no process already listening on this address and port to accept the connection in that moment ClientA will immediately end up with connection refused error (this is not entirely true, but most of time is, see my EDIT at the end of answer).
On Windows, under the hood Golang uses (for tcp, tcp4 and tcp6 protocols) ConnectEx API which is for connection-oriented sockets. This API behaves different from Linux connect API. If ConnectEx cannot connect immediately it returns error code ERROR_IO_PENDING and behind the scenes OS waits/retries until connection is accepted and established (or it gives up and makes it definitively failed) and then notifies back - this is called overlapped I/O.
Relevant part of MSDN ConnectEx documentation:
Connection-oriented sockets are often unable to complete their connection immediately, and therefore the operation is initiated and the function immediately returns with the ERROR_IO_PENDING or WSA_IO_PENDING error. When the connect operation completes and success or failure is achieved, status is reported using the completion notification mechanism indicated in lpOverlapped.
Now, what happens in your case on Windows is that you try to ConnectEx from both sides and OS connects those sockets for you. This will only work if other side gets connected within certain period. If you try to reasonably increase time.Sleep interval in both clients (e.g. 17 and 28), you can see even on Windows they will have hard time to connect anymore.
Answer to your question is that your code as it is written now depends on OS-specific behavior of TCP dialing in Golang on Windows and is not portable. To fix your software to be portable on any platform supported by Golang you probably want to change logic so both ClientA and ClientB listen for incoming connection and also periodically try to connect to the opposite side.
EDIT: I'm not saying your code can not work on Linux at all. It actually uses rare connection mode called TCP simultaneous connect where you can connect two processes without having any of them listen. Both dialing sides send their SYN simultaneously, so each side responds with SYN/ACK and then ACK to complete the 3-way handshake and ESTABLISH connection. That requires very precise timing and syncing of the connect call in both clients. Both sides would connect if TCP simultaneous connect is allowed in Linux kernel and that sync between connects is achieved (hardly done by just running both clients by hand or from same script; even simulating within same process and thread is not that easy).

Related

Denyhosts on Centos7 option DENY_THRESHOLD_INVALID does not work

using centos7 and denyhosts 2.9 i noticed some strange behavior.
My config is set to:
DENY_THRESHOLD_INVALID = 3
DENY_THRESHOLD_VALID = 10
Which, in my understanding is like: after 3 failed login attempts of NON-EXISTING users from hosts X, deny that host.
After 10 failed logins attempts from EXISTING users from hosts X, deny that host.
While the latter works just fine, the DENY_THRESHOLD_INVALID = 3 setting does not work.
What i noticed is that the /var/log/secure, that danyhosts parses, does handly logns from non-existing accounts and logins from account that exist but are using the wrong pasword, are handled differently.
Aug 10 12:32:42 ftp sshd[27176]: Invalid user adminx from xxx.128.30.135 port 42800
Aug 10 12:32:42 ftp sshd[27176]: input_userauth_request: invalid user adminx [preauth]
Aug 10 12:32:42 ftp sshd[27176]: Connection closed by xxx.128.30.135 port 42800 [preauth]
vs.
Aug 10 12:33:46 ftp sshd[27238]: Failed password for exchange from xxx.128.30.135 port 42802 ssh2
Does anyone know of denyhosts has problems parsing the /var/log/secure file on centos with non-existing accounts vs. existing accounts that use wrong passwords?
Denyhosts debug log also does not say anything. It seems to ignore the login attempt from non-existend users.
any help would be appreciated. Thanks.

Which process sends SIGKILL and terminates all SSH connections on/to my Namecheap Server?

I've been trying to troubleshoot this problem for some days now.
A couple of minutes after starting an SSH connection to my Namecheap server (on Mac/windows/cPanel's "Terminal"), it crashes and give the following error message :
Error: The connection to the server ended in failure at {TIME} PM. (SIGKILL)
and :
Exit Code: 137
I've tried to create some kind of log file for any SIGKILL signal, but, it seems like none can be made on a Namecheap server :
auditctl doesn't exist,
We can't get systemtap because no package managers are available.
Precision :
uname -a : Linux [-n] 2.6.32-954.3.5.lve1.4.78.el6.x86_64 #1 SMP Thu Mar 26 08:20:27 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
I calculated the time between each crash : around 6min.
I don't have a very good knowledge of Linux servers, and maybe didn't include needed information. So please ask for any specificities!

Missing "kernel: Firewall" messages

Where are my iptables logging Blocked messages? I wonder if this is an OpenVZ issue or something from the scripted install. Note, I'm highly technical, but not a server admin. Could the OpenVZ host be blocking and logging outside of my VSP?
I have two newly installed machines running running text-mode CentOS 7 x64, yum up to date packages, and with iptables/CSF.
Also, I ensured machine #2 has all the packages that are on machine #1, though #2 has some extras.
OpenVZ VPS (installed with their image of CentOS 7 x64)
VMware VM (installed with official CentOS 7 x64 minimal mode)
I performed my extra installs/configs exactly the same on both machines, and I have these lines in /etc/csf/csf.conf
TESTING = "0"
TCP_IN = "22,80,443"
UDP_IN = ""
On the VM, I'm getting these /var/log/messages when I nmap scan it:
Apr 12 17:25:23 mach kernel: Firewall: *UDP_IN Blocked* IN=ens192 OUT= ...
Apr 12 17:25:55 mach kernel: Firewall: *TCP_IN Blocked* IN=ens192 OUT= ...
On the VPS, I'm NOT getting any Firewall /var/log/messages when I nmap scan it... but I think it is properly blocking traffic.
How do I even proceed/diagnose this?

TCP establishment delays for 3 seconds

Most of time, my rsh cycle is general OK, we could get following logs from rshd:
Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root#172.17.0.40 as root: cmd='echo 481'
While for some error case, the rsh could success but there are several seconds delay, see the following timestamp:
Aug 19 04:12:24 shmm500 authpriv.info in.rshd[17968]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:12:27 shmm500 auth.info rshd[17972]: root#172.17.0.40 as root: cmd='echo 18'
I also found that, for most normal case, the PID increased by 1, while for most error case, PID increasd by 4, see the PID in above logs, seems rshd forks some processes. So would you provide any explanation for why rshd took these several seconds and PID increase.
Our rsh is the old rsh, not ssh, I'm not sure, but seems the rsh is from netkit. And this is an embedded board with busybox, no strace/pstack.
For client side, I just 'rsh 172.17.0.8 pwd', not hostname is used.
Answer the question by myself:
This issue was caused by a frame loss. Either SYN or SYN+ACK in 3-way handshake was dropped at a rare rate for some reason, anyway the client peer didn't get the SYN+ACK within in 3 seconds timeout(this timeout is hardcoded in Linux kernel), then the connect() resent SYN again, and usually successful at the second try.
From the viewpoint of application, we got 3 seconds delay, or even 6 seconds if it failed at the second try.
Other relevant information:
The first log is from tcpd(aka tcp wrapper)
Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
The second log is from rshd in netkit 0.17
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root#172.17.0.40 as root: cmd='echo 481'
rsh need two tcp connections, the first is from rsh client to rshd, and the second tcp connection is from rshd to rsh client, which means the rshd is the tcp client. And my issue is frame loss on the second tcp connection.

CoffeeScript ZMQ Pub not sending Messages

The same pub-sub code works on local machine (Linux zephyr 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:08:16 UTC 2014 i686 i686 i686 GNU/Linux).
However, on EC2 machine (Linux <host> 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux) it fails.
The security group is set to allow all for 19019 port and also, for all TCP ports starting from 0.
I tried adding prints in the NodeJS ZMQ module and was able to get the data that I am sending when I added it in flush function.
What else could be the problem?
I tried listening to pub traffic using tcpflow on port 19019 but it didn't work. How can I listen to this traffic?
sudo tcpflow -i eth0 port 19019 and sudo tcpflow -i lo port 19019
Both didn't work. Is there any tool through which I can debug this?
Pub.coffee
zmq = require 'zmq'
dpush_socket = zmq.socket 'pub'
dpush_socket.bind 'tcp://127.0.0.1:19019', (err) ->
if not err?
console.log "Bind successful"
dpush_socket.send 'pid' + ' req ' + req.query.pid
Sub.coffee
zmq = require "zmq"
endPoint = "tcp://0.0.0.0:19019"
sub = zmq.socket "sub"
sub.identity = 'worker' + process.pid;
sub.connect endPoint
console.log "worker connected!"
sub.subscribe('')
sub.on "message", (msg) ->
console.log(sub.identity + 'got ' + msg.toString())
Transport Class shall rather meet each other on the same IP:PORT#
Sub.coffee
zmq = require "zmq"
# # rather set URL, where PUB .bind() listens
endPoint = "tcp://127.0.0.1:19019" # endPoint = "tcp://0.0.0.0:19019"
Part of the answer is probably what user3666197 pointed out: you need to bind and connect on the same IP. I'm not sure what you intend with the 0.0.0.0 address, and it shouldn't work even on your local machine unless you found some undocumented corner of your network stack that supports this behavior.
The other thing is that you either want to include your send call in your callback, or probably want to use bindSync to ensure that the socket is bound before you attempt to send anything. What may be happening is that the socket is discarding your sent message because the socket hasn't completed binding by the time you get to the call. This could well be different between different machines.
The problem is I use a nodejs cluster module and in each of the work a zmq pub socket is created which binds on same port which messes up the issue. On my local machine its a single worker spawning.

Resources