Most of time, my rsh cycle is general OK, we could get following logs from rshd:
Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root#172.17.0.40 as root: cmd='echo 481'
While for some error case, the rsh could success but there are several seconds delay, see the following timestamp:
Aug 19 04:12:24 shmm500 authpriv.info in.rshd[17968]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:12:27 shmm500 auth.info rshd[17972]: root#172.17.0.40 as root: cmd='echo 18'
I also found that, for most normal case, the PID increased by 1, while for most error case, PID increasd by 4, see the PID in above logs, seems rshd forks some processes. So would you provide any explanation for why rshd took these several seconds and PID increase.
Our rsh is the old rsh, not ssh, I'm not sure, but seems the rsh is from netkit. And this is an embedded board with busybox, no strace/pstack.
For client side, I just 'rsh 172.17.0.8 pwd', not hostname is used.
Answer the question by myself:
This issue was caused by a frame loss. Either SYN or SYN+ACK in 3-way handshake was dropped at a rare rate for some reason, anyway the client peer didn't get the SYN+ACK within in 3 seconds timeout(this timeout is hardcoded in Linux kernel), then the connect() resent SYN again, and usually successful at the second try.
From the viewpoint of application, we got 3 seconds delay, or even 6 seconds if it failed at the second try.
Other relevant information:
The first log is from tcpd(aka tcp wrapper)
Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
The second log is from rshd in netkit 0.17
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root#172.17.0.40 as root: cmd='echo 481'
rsh need two tcp connections, the first is from rsh client to rshd, and the second tcp connection is from rshd to rsh client, which means the rshd is the tcp client. And my issue is frame loss on the second tcp connection.
Related
I've been trying to troubleshoot this problem for some days now.
A couple of minutes after starting an SSH connection to my Namecheap server (on Mac/windows/cPanel's "Terminal"), it crashes and give the following error message :
Error: The connection to the server ended in failure at {TIME} PM. (SIGKILL)
and :
Exit Code: 137
I've tried to create some kind of log file for any SIGKILL signal, but, it seems like none can be made on a Namecheap server :
auditctl doesn't exist,
We can't get systemtap because no package managers are available.
Precision :
uname -a : Linux [-n] 2.6.32-954.3.5.lve1.4.78.el6.x86_64 #1 SMP Thu Mar 26 08:20:27 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
I calculated the time between each crash : around 6min.
I don't have a very good knowledge of Linux servers, and maybe didn't include needed information. So please ask for any specificities!
Code
To reproduce requires two application running and connecting to each other through TCP. So I've made a tiny repo that also includes the powershell build script. link to the full repo
However to avoid the extra click, here is the code for clientA.go.
package main
import (
"fmt"
"net"
"time"
)
func main() {
clientA, err := net.ResolveTCPAddr("tcp4", fmt.Sprintf(":%v", "2222"))
if err != nil {
fmt.Println(err)
return
}
clientB, err := net.ResolveTCPAddr("tcp4", fmt.Sprintf(":%v", "3333"))
if err != nil {
fmt.Println(err)
return
}
for {
clientAtoB, err := net.DialTCP("tcp4", clientA, clientB)
if err != nil {
fmt.Println(err)
} else {
defer clientAtoB.Close()
clientAtoB.SetLinger(0)
clientAtoB.SetNoDelay(true)
clientAtoB.SetKeepAlive(false)
fmt.Println("connected as Client A!")
buffer := make([]byte, 64)
_, err = clientAtoB.Read(buffer)
if err != nil {
continue
}
}
time.Sleep(time.Second)
}
}
The code for clientB.go is identical except the local and remote endpoints are swapped around:
clientBtoA, err := net.DialTCP("tcp4", clientB, clientA)
Problem
I build the same go code for both Windows and Linux but at runtime the applications produce different results. Specifically with how TCP connections are dialed on each platform.
On Windows, when I run the two executables clientA.exe and clientB.exe (built from the build.ps1 script) I get the desired result. As seen in this screenshot:
However when I upload and execute the Linux binaries, the result is different:
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ sudo chmod +x clientA clientB
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ ls -la
total 10984
drwxrwxr-x 3 ubuntu ubuntu 4096 Apr 27 03:09 .
drwxrwxr-x 4 ubuntu ubuntu 4096 Apr 27 03:08 ..
drwxrwxr-x 8 ubuntu ubuntu 4096 Apr 27 03:08 .git
-rw-rw-r-- 1 ubuntu ubuntu 11255 Apr 27 03:12 A.txt
-rw-rw-r-- 1 ubuntu ubuntu 11255 Apr 27 03:12 B.txt
-rw-rw-r-- 1 ubuntu ubuntu 247 Apr 27 03:08 build.ps1
-rwxrwxr-x 1 ubuntu ubuntu 2950662 Apr 27 03:08 clientA
-rw-rw-r-- 1 ubuntu ubuntu 2642944 Apr 27 03:08 clientA.exe
-rw-rw-r-- 1 ubuntu ubuntu 718 Apr 27 03:08 clientA.go
-rwxrwxr-x 1 ubuntu ubuntu 2950662 Apr 27 03:08 clientB
-rw-rw-r-- 1 ubuntu ubuntu 2642944 Apr 27 03:08 clientB.exe
-rw-rw-r-- 1 ubuntu ubuntu 718 Apr 27 03:08 clientB.go
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ ./clientA > A.txt & ./clientB > B.txt &
[1] 24914
[2] 24915
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ cat A.txt
dial tcp4 :2222->:3333: connect: connection refused
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$ cat B.txt
dial tcp4 :3333->:2222: connect: connection refused
ubuntu#ip-172-31-16-224:~/go/src/github.com/fanmanpro/dial-vs-listen$
I don't expect the connection refused error since these two applications are running under the same environment, so no firewalls are in effect, and the permissions are identical.
How can I get the same result regardless of platform? Or why are they different in the first place?
Edit
The successful connection on Windows is not just the luck of good timing. On Windows, I can run A for 5 minutes, then when I run B, both connect successfully.
Update (2020-04-27)
After receiving feedback from Go developers, I've been told that this is likely a Linux configuration issue and not specific to Go. Other than permissions, I can't thing of anything that would prevent two applications in the same environment from establishing a TCP connection like this? (These low level Linux stuff isn't really my forte.)
Why this doesn't work on Linux is quite obvious. Both A and B are clients that are connecting to counterpart that needs to listen. On Linux (or UNIX) if you try to run ClientA it will try to dial in to ClientB's address and port. If there's no process already listening on this address and port to accept the connection in that moment ClientA will immediately end up with connection refused error (this is not entirely true, but most of time is, see my EDIT at the end of answer).
On Windows, under the hood Golang uses (for tcp, tcp4 and tcp6 protocols) ConnectEx API which is for connection-oriented sockets. This API behaves different from Linux connect API. If ConnectEx cannot connect immediately it returns error code ERROR_IO_PENDING and behind the scenes OS waits/retries until connection is accepted and established (or it gives up and makes it definitively failed) and then notifies back - this is called overlapped I/O.
Relevant part of MSDN ConnectEx documentation:
Connection-oriented sockets are often unable to complete their connection immediately, and therefore the operation is initiated and the function immediately returns with the ERROR_IO_PENDING or WSA_IO_PENDING error. When the connect operation completes and success or failure is achieved, status is reported using the completion notification mechanism indicated in lpOverlapped.
Now, what happens in your case on Windows is that you try to ConnectEx from both sides and OS connects those sockets for you. This will only work if other side gets connected within certain period. If you try to reasonably increase time.Sleep interval in both clients (e.g. 17 and 28), you can see even on Windows they will have hard time to connect anymore.
Answer to your question is that your code as it is written now depends on OS-specific behavior of TCP dialing in Golang on Windows and is not portable. To fix your software to be portable on any platform supported by Golang you probably want to change logic so both ClientA and ClientB listen for incoming connection and also periodically try to connect to the opposite side.
EDIT: I'm not saying your code can not work on Linux at all. It actually uses rare connection mode called TCP simultaneous connect where you can connect two processes without having any of them listen. Both dialing sides send their SYN simultaneously, so each side responds with SYN/ACK and then ACK to complete the 3-way handshake and ESTABLISH connection. That requires very precise timing and syncing of the connect call in both clients. Both sides would connect if TCP simultaneous connect is allowed in Linux kernel and that sync between connects is achieved (hardly done by just running both clients by hand or from same script; even simulating within same process and thread is not that easy).
Slackware OS, trying to setup fetchmail
I have coded this .fetchmailrc file:
set daemon 600 //fetches mail every hour or 60 minutes.
set logfile /root/fetchmail.log
poll 10.200.***.** protocol POP3
user "bob" password "bob" is "bob" here preconnect "date>>/root/fetchmail.log"
ssl
no rewrite
keep
It worked before but now it is failing to retrieve mail, i checked the fetchmail.log file and i get this error:
Thu Nov 5 10:15:32 GMT 2015
fetchmail: connection errors for this poll:
name 0: connection to 10.200.***.**:pop3s [10.200.***.**/995] failed: Connection refused.
fetchmail: POP3 connection to 10.200.***.** failed: Connection refused
fetchmail: Query status=2 (SOCKET)
I've reset the daemons, ended the process and no progress.
I had exactly the same problem on a Mageia 5 Linux. Apparently, I
solved it by redoing network configuration, which the Mageia can do
with a single click on the relevant Configure button in the Network
Center window.
I did not touch my .fetchmailrc file.
I am developing software in an embedded system (512 MB RAM). I'm using redis to take the place of a shared memory between processes inside a django application.
We are talking about 150 values, stored every second, coming from a MODBUS device. They all have the same key and their expire time is 10 minutes.
After some work hours (tipically a day), redis ceases to function, due to memory problems. Can someone help me out?
output of ps aux | grep redis
redis 1934 1.9 2.2 76216 8400 ? Ssl 07:49 10:37 /usr/bin/redis-server 127.0.0.1:6379
redis.info
# Server
redis_version:2.8.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:7cc01333adfd1c61
redis_mode:standalone
os:Linux 3.4.79+ armv7l
arch_bits:32
multiplexing_api:epoll
gcc_version:4.6.3
process_id:1934
run_id:be418b5a05b6670bb4bff9c73cc7126589d6b5c8
tcp_port:6379
uptime_in_seconds:33584
uptime_in_days:0
hz:10
lru_clock:858206
config_file:/etc/redis/redis.conf
# Clients
connected_clients:138
client_longest_output_list:54
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:6893800
used_memory_human:6.57M
used_memory_rss:6152192
used_memory_peak:47902480
used_memory_peak_human:45.68M
used_memory_lua:25600
mem_fragmentation_ratio:0.89
mem_allocator:jemalloc-3.0.0
# Persistence
loading:0
rdb_changes_since_last_save:1370469
rdb_bgsave_in_progress:0
rdb_last_save_time:1434611837
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
# Stats
total_connections_received:155
total_commands_processed:3207775
instantaneous_ops_per_sec:55
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:297
evicted_keys:0
keyspace_hits:2141758
keyspace_misses:12495
pubsub_channels:6
pubsub_patterns:0
latest_fork_usec:0
# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:295.85
used_cpu_user:333.85
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Keyspace
db0:keys=2,expires=2,avg_ttl=3185003
db1:keys=937,expires=242,avg_ttl=585637
snippet of redis-server.log
[1969] 18 Jun 22:17:21.819 # Server started, Redis version 2.8.6
[1969] 18 Jun 22:17:21.919 * DB loaded from disk: 0.100 seconds
[1969] 18 Jun 22:17:21.919 * The server is now ready to accept connections on port 6379
[1969] 18 Jun 22:17:21.919 * The server is now ready to accept connections at /var/run/redis/redis.sock
[1969] 19 Jun 09:16:50.444 # Client addr=127.0.0.1:38745 fd=9 name= age=39516 idle=4330 flags=N db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=13724 oll=415 omem=8509160 events=rw cmd=subscribe scheduled to be closed ASAP for overcoming of output buffer limits.
[1969] 19 Jun 09:20:54.056 # Client addr=127.0.0.1:38759 fd=14 name= age=4713 idle=4331 flags=N db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=13688 oll=415 omem=8509160 events=rw cmd=subscribe scheduled to be closed ASAP for overcoming of output buffer limits.
[1941] 19 Jun 09:17:17.134 # Unable to set the max number of files limit to 10032 (Operation not permitted), setting the max clients configuration to 3984.
Some lines from redis-cli monitor, if someone finds that useful. Being the same keys rewritten over and over again, it puzzles me with the high amount of memory used and all those files descriptors.
http://pastebin.com/rQqThUHF
We are encountering clock drift issues with our MongoDB replica set running on AWS. This just seemed to start happening recently after we added additional data to the set, before then we did not really notice this issue unless the system was under heavy load. The following error is logged in the mongod.log file sporadically and the system is not under load.
To test this we have isolated a set of machines with the same dataset and not in use by our web application though the error is still occurring;
2014-12-12T13:33:51.333+0000 [rsBackgroundSync] changing sync target
because current sync target's most recent OpTime is Dec 12 13:32:42:c
which is more than 30 seconds behind member mongo1:27017 whose most
recent OpTime is 1418391230
From the above the time stamp shows that one of the mongodb replica set members is over a minute behind. The worst we have seen is 12 minutes out of sync.
This error in turn causes replication lag and we receive the notification about this from the Mongo Monitoring Service although it does correct itself.
The setup is 3 x r3.xlarge AWS Linux instances, 1 in each availability zone of the EU-West-1A region. The machines have been setup using the Mongo recommended settings with a Raid array and the cloud formation scripts provided by Mongo. The data is around 4GB in size.
We think the issue is related to the NTP sync, by default on the AWS Linux Amazon Machine Image the ntpd service is configured to go to a pool of aws ntp servers hosted on www.pool.ntp.org.
To try and rule this out we setup our own NTP server on AWS that the MongoDB servers could sync to. The issue still occurred so we changed the maxpoll and minpoll time for the ntpd service on the mongo machines to sync the time every 16 seconds from the NTP server but the error is still occurring.
We increased the MongoDB OpLog size as well to see if that would make any difference but it didn’t.
Does anyone else encounter this type of issue? Is there something we are missing?
Cheers,
Colin.
ps -ef |grep ntp;
mongodb1
ntp 5163 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 15865 15839 0 09:31 pts/2 00:00:00 grep ntp
mongodb2
ntp 4834 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 19056 19029 0 09:31 pts/0 00:00:00 grep ntp
mongodb3
ntp 5795 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 26199 26173 0 09:31 pts/0 00:00:00 grep ntp
cat /etc/ntp.conf;
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1
# Hosts on local network are less restricted.
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.amazon.pool.ntp.org iburst dynamic
#server 1.amazon.pool.ntp.org iburst dynamic
#server 2.amazon.pool.ntp.org iburst dynamic
#server 3.amazon.pool.ntp.org iburst dynamic
server time-server.domain.com iburst
#broadcast 192.168.1.255 autokey # broadcast server
#broadcastclient # broadcast client
#broadcast 224.0.1.1 autokey # multicast server
#multicastclient 224.0.1.1 # multicast client
#manycastserver 239.255.254.254 # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
# Enable public key cryptography.
#crypto
includefile /etc/ntp/crypto/pw
# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys
# Specify the key identifiers which are trusted.
#trustedkey 4 8 42
# Specify the key identifier to use with the ntpdc utility.
#requestkey 8
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
# Enable additional logging.
logconfig =clockall =peerall =sysall =syncall
# Listen only on the primary network interface.
interface listen eth0
interface ignore ipv6
ntpq -npcrv;
remote refid st t when poll reach delay offset jitter
==============================================================================
*172.31.14.137 91.*.*.* 3 u 557 1024 377 1.121 -0.264 0.161
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5#1.2349-o Sat Mar 23 00:37:31 UTC 2013 (1)",
processor="x86_64", system="Linux/3.14.23-22.44.amzn1.x86_64", leap=00,
stratum=4, precision=-23, rootdelay=23.597, rootdisp=109.962,
refid=172.31.14.137,
reftime=d83a757a.175b5fa1 Tue, Dec 16 2014 9:10:18.091,
clock=d83a77a7.82431efa Tue, Dec 16 2014 9:19:35.508, peer=27361,
tc=10, mintc=3, offset=-0.264, frequency=-13.994, sys_jitter=0.000,
clk_jitter=0.358, clk_wander=0.053
After upgrading to MongoDB 3 using the WiredTiger storage engine we do not see this issue any more.