AWS - EC2 - MongoDB replica set time sync issue - NTP - replication lag - linux

We are encountering clock drift issues with our MongoDB replica set running on AWS. This just seemed to start happening recently after we added additional data to the set, before then we did not really notice this issue unless the system was under heavy load. The following error is logged in the mongod.log file sporadically and the system is not under load.
To test this we have isolated a set of machines with the same dataset and not in use by our web application though the error is still occurring;
2014-12-12T13:33:51.333+0000 [rsBackgroundSync] changing sync target
because current sync target's most recent OpTime is Dec 12 13:32:42:c
which is more than 30 seconds behind member mongo1:27017 whose most
recent OpTime is 1418391230
From the above the time stamp shows that one of the mongodb replica set members is over a minute behind. The worst we have seen is 12 minutes out of sync.
This error in turn causes replication lag and we receive the notification about this from the Mongo Monitoring Service although it does correct itself.
The setup is 3 x r3.xlarge AWS Linux instances, 1 in each availability zone of the EU-West-1A region. The machines have been setup using the Mongo recommended settings with a Raid array and the cloud formation scripts provided by Mongo. The data is around 4GB in size.
We think the issue is related to the NTP sync, by default on the AWS Linux Amazon Machine Image the ntpd service is configured to go to a pool of aws ntp servers hosted on www.pool.ntp.org.
To try and rule this out we setup our own NTP server on AWS that the MongoDB servers could sync to. The issue still occurred so we changed the maxpoll and minpoll time for the ntpd service on the mongo machines to sync the time every 16 seconds from the NTP server but the error is still occurring.
We increased the MongoDB OpLog size as well to see if that would make any difference but it didn’t.
Does anyone else encounter this type of issue? Is there something we are missing?
Cheers,
Colin.
ps -ef |grep ntp;
mongodb1
ntp 5163 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 15865 15839 0 09:31 pts/2 00:00:00 grep ntp
mongodb2
ntp 4834 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 19056 19029 0 09:31 pts/0 00:00:00 grep ntp
mongodb3
ntp 5795 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 26199 26173 0 09:31 pts/0 00:00:00 grep ntp
cat /etc/ntp.conf;
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1
# Hosts on local network are less restricted.
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.amazon.pool.ntp.org iburst dynamic
#server 1.amazon.pool.ntp.org iburst dynamic
#server 2.amazon.pool.ntp.org iburst dynamic
#server 3.amazon.pool.ntp.org iburst dynamic
server time-server.domain.com iburst
#broadcast 192.168.1.255 autokey # broadcast server
#broadcastclient # broadcast client
#broadcast 224.0.1.1 autokey # multicast server
#multicastclient 224.0.1.1 # multicast client
#manycastserver 239.255.254.254 # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
# Enable public key cryptography.
#crypto
includefile /etc/ntp/crypto/pw
# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys
# Specify the key identifiers which are trusted.
#trustedkey 4 8 42
# Specify the key identifier to use with the ntpdc utility.
#requestkey 8
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
# Enable additional logging.
logconfig =clockall =peerall =sysall =syncall
# Listen only on the primary network interface.
interface listen eth0
interface ignore ipv6
ntpq -npcrv;
remote refid st t when poll reach delay offset jitter
==============================================================================
*172.31.14.137 91.*.*.* 3 u 557 1024 377 1.121 -0.264 0.161
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5#1.2349-o Sat Mar 23 00:37:31 UTC 2013 (1)",
processor="x86_64", system="Linux/3.14.23-22.44.amzn1.x86_64", leap=00,
stratum=4, precision=-23, rootdelay=23.597, rootdisp=109.962,
refid=172.31.14.137,
reftime=d83a757a.175b5fa1 Tue, Dec 16 2014 9:10:18.091,
clock=d83a77a7.82431efa Tue, Dec 16 2014 9:19:35.508, peer=27361,
tc=10, mintc=3, offset=-0.264, frequency=-13.994, sys_jitter=0.000,
clk_jitter=0.358, clk_wander=0.053

After upgrading to MongoDB 3 using the WiredTiger storage engine we do not see this issue any more.

Related

How to confirm if TimeSync service is enabled on a RHEL 8.2 VM running in Azure?

Im new to linux - so im abit confused if i have to do any best practice time sync config with Azure, or not?
From https://learn.microsoft.com/en-us/windows-server/networking/windows-time-service/accurate-time?redirectedfrom=MSDN#allowing-linux-to-use-hyper-v-host-time
The above link mentions: "For Linux guests running in Hyper-V, clients are typically configured to use the NTP daemon for time synchronization against NTP servers. If the Linux distribution supports the TimeSync version 4 protocol and the Linux guest has the TimeSync integration service enabled, then it will synchronize against the host time. This could lead to inconsistent time keeping if both methods are enabled."
How can i confirm this?
How can i confirm if TimeSync service is enabled on my RHEL 8.2 VM running in Azure?
Also how can i confirm if my ntp daemaon is configured for time synchronization against NTP servers?
As part of my investigation I have run the following on the RHEL 8.2 VM (running in Azure)
My findings on this lab are that ntp is not configured directly (/etc/ntp.conf does not exist and (as recorded in earlier comments) the ntpq command is not found,.
[user#vm-aep-dev-eastu ~]$ service ntpd status
Redirecting to /bin/systemctl status ntpd.service
Unit ntpd.service could not be found
.
however "chrony" is active.Chrony appears to be synchronising the system clock with NTP servers.
systemctl status chronyd
● chronyd.service - NTP client/server
Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-07-16 08:58:39 UTC; 7h ago
Other details:
$ /sbin/lsmod | egrep -i "^hv|hyperv"
hv_utils 36864 2
hv_balloon 28672 0
hyperv_fb 20480 1
hv_netvsc 86016 0
hv_storvsc 20480 4
hid_hyperv 16384 0
hyperv_keyboard 16384 0
hv_vmbus 114688 7 hv_balloon,hv_utils,hv_netvsc,hid_hyperv,hv_storvsc,hyperv_keyboard,hyperv_fb
Thanks
From the document Time sync for Linux VMs in Azure,
On Ubuntu 19.10 and later versions, Red Hat Enterprise Linux, and
CentOS 8.x, chrony is configured to use a PTP source clock.
For more information about Red Hat and NTP, see Configure NTP.
If both chrony and VMICTimeSync sources are enabled simultaneously,
you can mark one as prefer, which sets the other source as a backup.
Because NTP services do not update the clock for large skews except
after a long period, the VMICTimeSync will recover the clock from
paused VM events far more quickly than NTP-based tools alone.
See here for more details.

Debian Linux Raspbian- Raspberry Pi time offset is 65s ahead of UTC

For some strange reason unknown to me, my RPi appears to have been set incorrectly to UTC +65s. The output I receive is the following:
sudo ntpd -gq
ntpd: time set -65.706156s
I have tried stopping and restarting ntp server (no effect).
When I check the sync servers using the following command, I do receive a ping back so it's not a case of the servers not responding, or a firewall issue:
grep -P "^server" /etc/ntp.conf
server 0.debian.pool.ntp.org iburst
server 1.debian.pool.ntp.org iburst
server 2.debian.pool.ntp.org iburst
server 3.debian.pool.ntp.org iburst
ping -c 1 0.debian.pool.ntp.org
PING 0.debian.pool.ntp.org (193.1.219.116) 56(84) bytes of data.
64 bytes from tbag.heanet.ie (193.1.219.116): icmp_req=1 ttl=51 time=18.8 ms
--- 0.debian.pool.ntp.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 18.818/18.818/18.818/0.000 ms
I'm at a loss as to how to correct this.
UPDATE:
Running the ntpq -p command yields the following info:
remote refid st t when poll reach delay offset jitter
==============================================================================
*adsl-172-10-0-1 117.70.*.110 4 u 2 64 7 0.617 -0.070 0.109
Is this the ntp server that I'm trying to sync to - because that IP belongs to CHINANET (I don't know how or why).
I also tried to manually set the RPi time, after stopping ntp service, setting the time correctly and restarting the service.
What I noticed was that the time was correctly set for a good 5 seconds, before reverting back to it's 65s offset. So it appears that this is the issue.
Found the solution as described in post 6 of the link:
http://forum.openmediavault.org/index.php/Thread/13035-Raspberry-Pi-NTP-service-not-using-etc-ntp-conf/
Basically, connecting the RPi to the network, the DHCP server acts as the NTP server and creates a copy of the ntp.conf file in the location /var/lib/ntp/ntp.conf.dhcp
This file overrides the default /etc/ntp.conf file, so deleting it and then stopping the ntp service, performing a resync, and then starting the service is the only way to resolve this.
The command for resync is:
sudo ntpdate -b pool.ntp.org
The original issue was that the ntp server was syncing with a CHINANET server and causing a 65s offset, which I suspect is down to a misconfigured DCHP/NTP server on our network.

Redis and memory

I am developing software in an embedded system (512 MB RAM). I'm using redis to take the place of a shared memory between processes inside a django application.
We are talking about 150 values, stored every second, coming from a MODBUS device. They all have the same key and their expire time is 10 minutes.
After some work hours (tipically a day), redis ceases to function, due to memory problems. Can someone help me out?
output of ps aux | grep redis
redis 1934 1.9 2.2 76216 8400 ? Ssl 07:49 10:37 /usr/bin/redis-server 127.0.0.1:6379
redis.info
# Server
redis_version:2.8.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:7cc01333adfd1c61
redis_mode:standalone
os:Linux 3.4.79+ armv7l
arch_bits:32
multiplexing_api:epoll
gcc_version:4.6.3
process_id:1934
run_id:be418b5a05b6670bb4bff9c73cc7126589d6b5c8
tcp_port:6379
uptime_in_seconds:33584
uptime_in_days:0
hz:10
lru_clock:858206
config_file:/etc/redis/redis.conf
# Clients
connected_clients:138
client_longest_output_list:54
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:6893800
used_memory_human:6.57M
used_memory_rss:6152192
used_memory_peak:47902480
used_memory_peak_human:45.68M
used_memory_lua:25600
mem_fragmentation_ratio:0.89
mem_allocator:jemalloc-3.0.0
# Persistence
loading:0
rdb_changes_since_last_save:1370469
rdb_bgsave_in_progress:0
rdb_last_save_time:1434611837
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
# Stats
total_connections_received:155
total_commands_processed:3207775
instantaneous_ops_per_sec:55
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:297
evicted_keys:0
keyspace_hits:2141758
keyspace_misses:12495
pubsub_channels:6
pubsub_patterns:0
latest_fork_usec:0
# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:295.85
used_cpu_user:333.85
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Keyspace
db0:keys=2,expires=2,avg_ttl=3185003
db1:keys=937,expires=242,avg_ttl=585637
snippet of redis-server.log
[1969] 18 Jun 22:17:21.819 # Server started, Redis version 2.8.6
[1969] 18 Jun 22:17:21.919 * DB loaded from disk: 0.100 seconds
[1969] 18 Jun 22:17:21.919 * The server is now ready to accept connections on port 6379
[1969] 18 Jun 22:17:21.919 * The server is now ready to accept connections at /var/run/redis/redis.sock
[1969] 19 Jun 09:16:50.444 # Client addr=127.0.0.1:38745 fd=9 name= age=39516 idle=4330 flags=N db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=13724 oll=415 omem=8509160 events=rw cmd=subscribe scheduled to be closed ASAP for overcoming of output buffer limits.
[1969] 19 Jun 09:20:54.056 # Client addr=127.0.0.1:38759 fd=14 name= age=4713 idle=4331 flags=N db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=13688 oll=415 omem=8509160 events=rw cmd=subscribe scheduled to be closed ASAP for overcoming of output buffer limits.
[1941] 19 Jun 09:17:17.134 # Unable to set the max number of files limit to 10032 (Operation not permitted), setting the max clients configuration to 3984.
Some lines from redis-cli monitor, if someone finds that useful. Being the same keys rewritten over and over again, it puzzles me with the high amount of memory used and all those files descriptors.
http://pastebin.com/rQqThUHF

Ntp on secondary/redundanct system does not sync time from primary

I have two systems One acts as Primary/Active and has Internet connection and gets time from NTP server. The second system is Secondary/Passive and has no connection to the external world.
Primary and Secondary are connected on a private network interface eth1.
Primary has IP 169.254.10.10 Subnet: 255.255.255.248 Broadcast: 169.254.10.15
Secondary has IP 169.254.10.11 Subnet: 255.255.255.248 Broadcast: 169.254.10.15
Primary has the following ntp.conf Configuration
driftfile /var/lib/ntp/ntp.drift
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
server 91.189.94.4
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
restrict 127.0.0.1
restrict ::1
restrict 169.254.10.0 mask 255.255.255.248
broadcast 169.254.10.15
disable auth
broadcastclient
I sync time on Secondary with only ntpdate and do not run ntpd daemon on Secondary.
on Secondary i run ntpdate -b -t 4 -p 4 -u 169.254.10.10 (Primary Interface IP)
And Ntpd server is running on Primary with the above said Configuration.
The time on Secondary is not updated and throws error
ntpdate[3636]: no server suitable for synchronization found
Thanks
Visu

Nagios check_ntp_peer not working

I am running a virtualized (vmware) debian (2.6.26-2-686) which I monitor through Nagios. Lastly, I am getting the following Critical error (reported by the _check_ntp_peer_ script):
NTP CRITICAL: Server not synchronized, Offset unknown
It calls my attent
ion that none of the lines outputted by the _ntpq –no_ command has a star (*)
remote refid st t when poll reach delay offset jitter
==============================================================================
200.144.121.33 193.204.114.232 2 u 1 64 1 187.298 -34742. 32.024
146.164.53.65 200.20.186.75 2 u 2 64 1 185.574 -34716. 0.001
200.160.0.8 200.160.7.186 2 u 1 64 1 186.229 -34734. 0.001
187.49.33.13 .INIT. 16 u - 64 0 0.000 0.000 0.001
Any clue?
Here is the ntp.conf
tinker panic 0
driftfile /var/lib/ntp/ntp.drift
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
server 0.debian.pool.ntp.org iburst dynamic
server 1.debian.pool.ntp.org iburst dynamic
server 2.debian.pool.ntp.org iburst dynamic
server 3.debian.pool.ntp.org iburst dynamic
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
restrict 127.0.0.1
restrict ::1
So, any idea of what the problem could be?
Thanks in advance,
Wilmer
I had similar problems on ubuntu and ntp. Time was drifting off dramatically and nagios reported NTP CRITICAL: Offset unknown.
Check for status of your vmware timesync
#vmware-toolbox-cmd timesync status
Disable
Enable it if you notice it is disabled.
#vmware-toolbox-cmd timesync enable
Enabled
Helped in my case. May be helpful in yours too. I think it is not in accordance with vmware best practices but it works.

Resources