I am trying to run a second node on a different processor, either an ARM or a second x86_64. I have a DomMgr running on one x86_64 and attempting to start a node on either another x86_64 or arm using nodeBooter. The DevMgr starts and registers with the DomMgr but when it starts the GPP device it "Requesting IDM CHANNEL IDM CHANNEL IDM_CHANNEL" and then immediately "terminate called after throwing an instance of 'CORBA::OBJECT_NOT_EXIST'". The DomMgr printed out to the console that "Domain Channel: IDM_Channel created". Is it supposed to register that in the NameService or why does the remote DevMgr get an invalid object ref when it tries to get it?
I did not realize I could clarify my question by editing it to add new findings.
I'll do that from now on.
By using ORBtraceLevel on the remote DevMgr I found that I had different problem
on my remote x86-based DevMgr and my ARM-based one, even though the normal error
messages were the same. The x86 case was simply that I my exported DevMgr dcd
used the same name and id as one running locally on the Domain. When I fixed that
I have no problem with the x86-based remote DevMgr starting its GPP device and
registering.
But this is NOT the problem for the ARM-based case. With traceLevel=10 I started
DevMgr on both my x86 successfully and my ARM and compared the outputs. First I
should mention that my ARM is running Ubuntu 16.04 on a RaspberryPi 3. The cpu
is 64-bit but no distro for either Ubuntu or CentOS is available as 64-bit so
the OS is 32-bit Ubuntu for now. I know that RedHawk 2.0 says it only now supports
64-bit CentOS so perhaps that is the problem, although I was able to build RedHawk
with no trouble and most of it works fine. But trace does show two warnings
WARN Device_impl:172 - Cannot set allocation implementation: Property ### is
of type 'CORBA::Long' (not 'int')
which do not show in the x86 case and I believe are due to the different sizes of int.
If I do not start an Event Service on the domain, these same warnings show but I am
able to start the GPP fine and run waveforms. So I do not know if this is related to
my OBJECT_NOT_FOUND error in GPP or not but thought I should mention it.
Trace shows one successful
Creating ref to remote: REDHAWK.DEV.IDM.Channel
target id :IDL:omg.org/CosEventChannelAdmin/EventChannel:1.0
most derived id:
Adding root/Files<3> (activating) to object table.
but on the second case it immedately shows
Adding root<3> (activating) to object table.
followed by
throw OBJECT_NOT_EXIST from GIOP_C.cc:281 (NO,OBJECT_NOT_EXIST_NoMatch)
throw OBJECT_NOT_EXIST from omniOrbRef.cc:829 (NO,OBJECT_NOT_EXIST_NoMatch)
and then GPP terminates with signal 6.
The successful x86 trace shows the same Creating ref and Adding root<3> but then
has
Creating ref to remote: root/REDHAWK_DEV.IDM_Channel <...>
Can this be related to the 32-bit vs 64-bit or why would this happen only on the
ARM based GPP?
Note that I have iptables accepting any traffic from my subdomain on x86s and is not
running at all on the ARM. There is a lot of successful connections including queries
with nameclt, so this is not (as far as I can tell) a network connection issue.
What version of REDHAWK are you running? What OS? Can you provide a list of all the omni rpms you have installed on your machine?
It sounds like something is miss-configured on your system, perhaps IPTables or selinux? Lets walk through a quick example to show the minimum needed configuration and running processes needed for a multi-node system. If this does not clear things up, I'd suggest rerunning the domain and device manager with TRACE level debugging enabled and examine the output for any anomalies or disable selinux and iptables temporarily to rule them out as issues.
I'll use a REDHAWK 2.0.1 docker image as a tool to walk through the example. The installation steps used to build this image can be found here.
First we'll drop into a a REDHAWK 2.0.1 environment with nothing running and label this container as our domain manager
[youssef#axios(0) docker-redhawk]$docker run -it --name=domainMgr axios/redhawk:2.0
Let's confirm that almost nothing is running on this container
[redhawk#ce4df2ff20e4 ~]$ ps -ef
UID PID PPID C STIME TTY TIME CMD
redhawk 1 0 0 12:55 ? 00:00:00 /bin/bash -l
redhawk 27 1 0 12:57 ? 00:00:00 ps -ef
Lets take a look at the current omniORB configuration file. This will be the box we run omniNames, omniEvents and the domain manager.
[redhawk#ce4df2ff20e4 ~]$ cat /etc/omniORB.cfg
InitRef = NameService=corbaname::127.0.0.1:2809
supportBootstrapAgent = 1
InitRef = EventService=corbaloc::127.0.0.1:11169/omniEvents
Since this will be the machine we are running omniNames and omniEvents on, the loopback address (127.0.0.1) is fine however other machines will need to reference this machine either by its hostname (domainMgr) or it's IP address so we can note it's IP now.
[redhawk#ce4df2ff20e4 ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:0E
inet addr:172.17.0.14 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:acff:fe11:e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:468 (468.0 b) TX bytes:558 (558.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Note it only has a single interface so we do not need to specify an endPoint. However specifying the unix socket endpoint would provide a performance boost for any locally running components.
We can now startup omniNames, omniEvents, and the domain manager and after each step see what is running. The "extra operand" output on omniNames is expected on newer versions of CentOS6 and is an issue with the omniNames init script.
[redhawk#ce4df2ff20e4 ~]$ sudo service omniNames start
Starting omniNames: /usr/bin/dirname: extra operand `2>&1'
Try `/usr/bin/dirname --help' for more information.
[ OK ]
[redhawk#ce4df2ff20e4 ~]$ ps -ef
UID PID PPID C STIME TTY TIME CMD
redhawk 1 0 0 12:55 ? 00:00:00 /bin/bash -l
omniORB 50 1 0 13:01 ? 00:00:00 /usr/bin/omniNames -start -always -logdir /var/log/omniORB/ -errlog /var/log/omniORB/error.log
redhawk 53 1 0 13:01 ? 00:00:00 ps -ef
[redhawk#ce4df2ff20e4 ~]$ sudo service omniEvents start
Starting omniEvents [ OK ]
[redhawk#ce4df2ff20e4 ~]$ ps -ef
UID PID PPID C STIME TTY TIME CMD
redhawk 1 0 0 12:55 ? 00:00:00 /bin/bash -l
omniORB 50 1 0 13:01 ? 00:00:00 /usr/bin/omniNames -start -always -logdir /var/log/omniORB/ -errlog /var/log/omniORB/error.log
root 69 1 0 13:01 ? 00:00:00 /usr/sbin/omniEvents -P /var/run/omniEvents.pid -l /var/lib/omniEvents -p 11169
redhawk 79 1 0 13:01 ? 00:00:00 ps -ef
I'm going to start up the domain manager in the foreground and grab the output of ps -ef via a "docker exec domainMgr ps -ef" in a different terminal
[redhawk#ce4df2ff20e4 ~]$ nodeBooter -D
2016-06-22 13:03:21 INFO DomainManager:257 - Loading DEFAULT logging configuration.
2016-06-22 13:03:21 INFO DomainManager:368 - Starting Domain Manager
2016-06-22 13:03:21 INFO DomainManager_impl:208 - Domain Channel: ODM_Channel created.
2016-06-22 13:03:21 INFO DomainManager_impl:225 - Domain Channel: IDM_Channel created.
2016-06-22 13:03:21 INFO DomainManager:455 - Starting ORB!
[youssef#axios(0) docker-redhawk]$docker exec domainMgr ps -ef
UID PID PPID C STIME TTY TIME CMD
redhawk 1 0 0 12:55 ? 00:00:00 /bin/bash -l
omniORB 50 1 0 13:01 ? 00:00:00 /usr/bin/omniNames -start -always -logdir /var/log/omniORB/ -errlog /var/log/omniORB/error.log
root 69 1 0 13:01 ? 00:00:00 /usr/sbin/omniEvents -P /var/run/omniEvents.pid -l /var/lib/omniEvents -p 11169
redhawk 80 1 0 13:03 ? 00:00:00 DomainManager DEBUG_LEVEL 3 DMD_FILE /domain/DomainManager.dmd.xml DOMAIN_NAME REDHAWK_DEV FORCE_REBIND false PERSISTENCE true SDRROOT /var/redhawk/sdr
redhawk 93 0 1 13:03 ? 00:00:00 ps -ef
So we can see that we have omniNames, omniEvents, and the DomainManager binaries running. Time to move on to a new node for the device manager.
In a new terminal I create a new container and call it deviceManager
[youssef#axios(0) docker-redhawk]$docker run -it --name=deviceManager axios/redhawk:2.0
Confirm nothing is really running, then take a look at the omniORB configuration file.
[redhawk#765ce325f145 ~]$ ps -ef
UID PID PPID C STIME TTY TIME CMD
redhawk 1 0 0 13:05 ? 00:00:00 /bin/bash -l
redhawk 28 1 0 13:06 ? 00:00:00 ps -ef
[redhawk#765ce325f145 ~]$ cat /etc/omniORB.cfg
InitRef = NameService=corbaname::127.0.0.1:2809
supportBootstrapAgent = 1
InitRef = EventService=corbaloc::127.0.0.1:11169/omniEvents
We need to change where the NameService and EventService IPs are pointing to either our domain managers hostname (domainMgr) or IP address (172.17.0.14) I will go with IP address.
[redhawk#765ce325f145 ~]$ sudo sed -i 's,127.0.0.1,172.17.0.14,g' /etc/omniORB.cfg
[redhawk#765ce325f145 ~]$ cat /etc/omniORB.cfg
InitRef = NameService=corbaname::172.17.0.14:2809
supportBootstrapAgent = 1
InitRef = EventService=corbaloc::172.17.0.14:11169/omniEvents
We can confirm this worked using nameclt list to show the entry in omniNames of the event channel factory and the domain.
[redhawk#765ce325f145 ~]$ nameclt list
EventChannelFactory
REDHAWK_DEV/
Finally we can start up the device manager and inspect the running processes in a new shell via "docker exec deviceManager ps -ef"
[redhawk#765ce325f145 ~]$ nodeBooter -d /var/redhawk/sdr/dev/nodes/DevMgr_12ef887a9000/DeviceManager.dcd.xml
2016-06-22 13:09:09 INFO DeviceManager:446 - Starting Device Manager with /nodes/DevMgr_12ef887a9000/DeviceManager.dcd.xml
2016-06-22 13:09:09 INFO DeviceManager_impl:367 - Connecting to Domain Manager REDHAWK_DEV/REDHAWK_DEV
2016-06-22 13:09:09 INFO DeviceManager:494 - Starting ORB!
2016-06-22 13:09:09 INFO Device:995 - DEV-ID:DCE:c5029226-ce70-48d9-9533-e025fb9c2a34 Requesting IDM CHANNEL IDM_Channel
2016-06-22 13:09:09 INFO redhawk::events::Manager:573 - PUBLISHER - Channel:IDM_Channel Reg-Id21f4e766-c5c6-4c5b-8974-337736e71f87 RESOURCE:DCE:c5029226-ce70-48d9-9533-e025fb9c2a34
2016-06-22 13:09:09 INFO DeviceManager_impl:1865 - Registering device GPP_12ef887a9000 on Device Manager DevMgr_12ef887a9000
2016-06-22 13:09:09 INFO DeviceManager_impl:1907 - Device LABEL: GPP_12ef887a9000 SPD loaded: GPP' - 'DCE:4e20362c-4442-4656-af6d-aedaaf13b275
2016-06-22 13:09:09 INFO GPP:658 - initialize()
2016-06-22 13:09:09 INFO redhawk::events::Manager:626 - SUBSCRIBER - Channel:ODM_Channel Reg-Id0d18c1f4-71bf-42c2-9a2d-416f16af9fcf resource:DCE:c5029226-ce70-48d9-9533-e025fb9c2a34
2016-06-22 13:09:09 INFO GPP_i:679 - Component Output Redirection is DISABLED.
2016-06-22 13:09:09 INFO GPP:1611 - Affinity Disable State, disabled=1
2016-06-22 13:09:09 INFO GPP:1613 - Disabling affinity processing requests.
2016-06-22 13:09:09 INFO GPP_i:571 - SOCKET CPUS USER SYSTEM IDLE
2016-06-22 13:09:09 INFO GPP_i:577 - 0 8 0.00 0.00 0.00
2016-06-22 13:09:09 INFO GPP:616 - initialize CPU Montior --- wl size 8
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (docker0)
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (em1)
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (lo)
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (tun0)
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (vboxnet0)
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (veth70de860)
2016-06-22 13:09:10 INFO GPP_i:602 - initializeNetworkMonitor: Adding interface (vethd0227d6)
2016-06-22 13:09:10 INFO DeviceManager_impl:2087 - Registering device GPP_12ef887a9000 on Domain Manager
[youssef#axios(0) docker-redhawk]$docker exec deviceManager ps -ef
UID PID PPID C STIME TTY TIME CMD
redhawk 1 0 0 13:05 ? 00:00:00 /bin/bash -l
redhawk 35 1 0 13:09 ? 00:00:00 DeviceManager DCD_FILE /nodes/DevMgr_12ef887a9000/DeviceManager.dcd.xml DEBUG_LEVEL 3 DOMAIN_NAME REDHAWK_DEV SDRCACHE /var/redhawk/sdr/dev SDRROOT /var/redhawk/sdr
redhawk 40 35 1 13:09 ? 00:00:00 /var/redhawk/sdr/dev/devices/GPP/cpp/GPP PROFILE_NAME /devices/GPP/GPP.spd.xml DEVICE_ID DCE:c5029226-ce70-48d9-9533-e025fb9c2a34 DEVICE_LABEL GPP_12ef887a9000 DEBUG_LEVEL 3 DOM_PATH REDHAWK_DEV/DevMgr_12ef887a9000 DCE:218e612c-71a7-4a73-92b6-bf70959aec45 False DCE:3bf07b37-0c00-4e2a-8275-52bd4e391f07 1.0 DCE:442d5014-2284-4f46-86ae-ce17e0749da0 0 DCE:4e416acc-3144-47eb-9e38-97f1d24f7700 DCE:5a41c2d3-5b68-4530-b0c4-ae98c26c77ec 0 DEVICE_MGR_IOR IOR:010000001900000049444c3a43462f4465766963654d616e616765723a312e3000000000010000000000000070000000010102000c0000003137322e31372e302e313500a49e00001c000000ff4465766963654d616e61676572fef58d6a570100002300000000000200000000000000080000000100000000545441010000001c00000001000000010001000100000001000105090101000100000009010100
redhawk 398 0 0 13:09 ? 00:00:00 ps -ef
So we've successfully spun up two machines on the same network with unique IP addresses, designated one as the domain manager, omniNames, and omniEvents server and the other as a Device Manager / GPP node. At this point, we could connect to the domain manager either via the IDE or through a python interface and launch waveforms; we would expect these waveforms to launch on the sole device manager node.
Related
I just installed CentOS 7 [Kernel 3.10.0-514] on my USB stick.
Operating system works fine but I had some problems with my Broadcom 43227 wireless card.
I downloaded driver, patched it, changed code a bit according to the instruction here: https://wiki.centos.org/HowTos/Laptops/Wireless/Broadcom and after many attempts it finally compilled and after loading the driver module into kernel led turned on.
Now I need to connect to my Wi-Fi.
What am I trying to do:
Get wireless interface name using iw dev:
phy#0
Interface wlp2s0
Scan to find WiFi Network using iw wlp2s0 scan | grep SSID
SSID: MyNetworkName
Generate a WPA/WPA2 configuration file using wpa_passphrase MyNetworkName >> /etc/wpa_supplicant.conf
MyNetworkPassword
Connect to WPA/WPA2 WiFi network using wpa_supplicant -B -D wext -i wlp2s0 -c /etc/wpa_supplicant.conf
Successfylly initialized wpa_supplicant
[and in some cases after few minutes]
ERROR #wl_cfg80211_scan: WLC_SCAN error (-22)
Get an IP using dhclient using dhclient wlp2s0
But nohing happens
Ping command : Name or sarvice not known
If I run wpa_supplicant without -B I get some repeating errors:
Device or resource busy
wlp2s0: Failed to initiate AP scan
wlp2s0: Trying to associate with [MAC] (SSID='MyNetName' freq=2462 MHz)
Operation not supported
wlp2s0: Association request to the driver failed
....
if I add -D nl80211 to wpa_supplicant call I get same errors without "Device or resource busy"
What I am doing wrong?
My Dear All the Greatest Lords,
Some expert listed the details of connecting to a wireless network as,
This is a step-to-step guide for connecting to a WPA/WPA2 WiFi network via the Linux command line interface. The tools are:
wpa_supplicant
iw
ip
ping
iw is the basic tool for WiFi network-related tasks, such as finding the WiFi device name, and scanning access points. wpa_supplicant is the wireless tool for connecting to a WPA/WPA2 network. ip is used for enabling/disabling devices, and finding out general network interface information.
The steps for connecting to a WPA/WPA2 network are:
Find out the wireless device name.
$ /sbin/iw dev
phy#0
Interface wlan0
ifindex 3
type managed
The above output showed that the system has 1 physical WiFi card, designated as phy#0. The device name is wlan0. The type specifies the operation mode of the wireless device. managed means the device is a WiFi station or client that connects to an access point.
Check that the wireless device is up.
$ ip link show wlan0
3: wlan0: (BROADCAST,MULTICAST) mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
link/ether 74:e5:43:a1:ce:65 brd ff:ff:ff:ff:ff:ff
Look for the word "UP" inside the brackets in the first line of the output.
In the above example, wlan0 is not UP. Execute the following command to bring it up:
$ sudo ip link set wlan0 up
[sudo] password for peter:
Note: you need root privilege for the above operation.
If you run the show link command again, you can tell that wlan0 is now UP.
$ ip link show wlan0
3: wlan0: (NO-CARRIER,BROADCAST,MULTICAST,UP) mtu 1500 qdisc mq state DOWN mode DEFAULT qlen 1000
link/ether 74:e5:43:a1:ce:65 brd ff:ff:ff:ff:ff:ff
Check the connection status.
$ /sbin/iw wlan0 link
Not connected.
The above output shows that you are not connected to any network.
Scan to find out what WiFi network(s) are detected
$ sudo /sbin/iw wlan0 scan
BSS 00:14:d1:9c:1f:c8 (on wlan0)
... sniped ...
freq: 2412
SSID: stanford
RSN: * Version: 1
* Group cipher: CCMP
* Pairwise ciphers: CCMP
* Authentication suites: PSK
* Capabilities: (0x0000)
... sniped ...
The 2 important pieces of information from the above are the SSID and the security protocol (WPA/WPA2 vs WEP). The SSID from the above example is stanford. The security protocol is RSN, also commonly referred to as WPA2. The security protocol is important because it determines what tool you use to connect to the network.
Connect to WPA/WPA2 WiFi network.
This is a 2 step process. First, you generate a configuration file for wpa_supplicant that contains the pre-shared key ("passphrase") for the WiFi network.
$ sudo -s
[sudo] password for peter:
$ wpa_passphrase stanford >> /etc/wpa_supplicant.conf
...type in the passphrase and hit enter...
wpa_passphrase takes the SSID as the single argument. You must type in the passphrase for the WiFi network stanford after you run the command. Using that information, wpa_passphrase will output the necessary configuration statements to the standard output. Those statements are appended to the wpa_supplicant configuration file located at /etc/wpa_supplicant.conf.
Note: you need root privilege to write to /etc/wpa_supplicant.conf.
$ cat /etc/wpa_supplicant.conf
# reading passphrase from stdin
network={
ssid="stanford"
#psk="testtest"
psk=4dfe1c985520d26a13e932bf0acb1d4580461dd854ed79ad1a88ec221a802061
}
The second step is to run wpa_supplicant with the new configuration file.
$ sudo wpa_supplicant -B -D wext -i wlan0 -c /etc/wpa_supplicant.conf
-B means run wpa_supplicant in the background.
-D specifies the wireless driver. wext is the generic driver.
-c specifies the path for the configuration file.
Use the iw command to verify that you are indeed connected to the SSID.
$ /sbin/iw wlan0 link
Connected to 00:14:d1:9c:1f:c8 (on wlan0)
SSID: stanford
freq: 2412
RX: 63825 bytes (471 packets)
TX: 1344 bytes (12 packets)
signal: -27 dBm
tx bitrate: 6.5 MBit/s MCS 0
bss flags: short-slot-time
dtim period: 0
beacon int: 100
Obtain IP address by DHCP
$ sudo dhclient wlan0
Use the ip command to verify the IP address assigned by DHCP. The IP address is 192.168.1.113 from below.
$ ip addr show wlan0
3: wlan0: mtu 1500 qdisc mq state UP qlen 1000
link/ether 74:e5:43:a1:ce:65 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.113/24 brd 192.168.1.255 scope global wlan0
inet6 fe80::76e5:43ff:fea1:ce65/64 scope link
valid_lft forever preferred_lft forever
Add default routing rule.
The last configuration step is to make sure that you have the proper routing rules.
$ ip route show
192.168.1.0/24 dev wlan0 proto kernel scope link src 192.168.1.113
The above routing table contains only 1 rule which redirects all traffic destined for the local subnet (192.168.1.x) to the wlan0 interface. You may want to add a default routing rule to pass all other traffic through wlan0 as well.
$ sudo ip route add default via 192.168.1.254 dev wlan0
$ ip route show
default via 192.168.1.254 dev wlan0
192.168.1.0/24 dev wlan0 proto kernel scope link src 192.168.1.113
ping external ip address to test connectivity
$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_req=1 ttl=48 time=135 ms
64 bytes from 8.8.8.8: icmp_req=2 ttl=48 time=135 ms
64 bytes from 8.8.8.8: icmp_req=3 ttl=48 time=134 ms
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 134.575/134.972/135.241/0.414 ms
The above series of steps is a very verbose explanation of how to connect a WPA/WPA2 WiFi network. Some steps can be skipped as you connect to the same access point for a second time. For instance, you already know the WiFi device name, and the configuration file is already set up for the network. The process needs to be tailored according to your situation.
Thoroughly followed the above tutorial, I failed to connect the wireless router.
(working as root)
......
#wpa_supplicant -B -i wlan0 -c /etc/wpa_supplicant.conf -D wext
#iw wlan0 link
Not connected.
Even I disable the WPA authentication using,
iwconfig wlan0 essid XXXXXXXXXXXXX
of no avail.
But the GNOME wireless tray is functioning(can select, connect, disconnect etc.)
Thank you a lot in advance.
Latest wpa_supplicant is able to do all the job itself.
The wpa_supplicant option you wrote seem to me ok.
But please, check the options in the file "/etc/wpa_supplicant.conf", if it readable and is well written (ssid,wpa,password correct....)
We are encountering clock drift issues with our MongoDB replica set running on AWS. This just seemed to start happening recently after we added additional data to the set, before then we did not really notice this issue unless the system was under heavy load. The following error is logged in the mongod.log file sporadically and the system is not under load.
To test this we have isolated a set of machines with the same dataset and not in use by our web application though the error is still occurring;
2014-12-12T13:33:51.333+0000 [rsBackgroundSync] changing sync target
because current sync target's most recent OpTime is Dec 12 13:32:42:c
which is more than 30 seconds behind member mongo1:27017 whose most
recent OpTime is 1418391230
From the above the time stamp shows that one of the mongodb replica set members is over a minute behind. The worst we have seen is 12 minutes out of sync.
This error in turn causes replication lag and we receive the notification about this from the Mongo Monitoring Service although it does correct itself.
The setup is 3 x r3.xlarge AWS Linux instances, 1 in each availability zone of the EU-West-1A region. The machines have been setup using the Mongo recommended settings with a Raid array and the cloud formation scripts provided by Mongo. The data is around 4GB in size.
We think the issue is related to the NTP sync, by default on the AWS Linux Amazon Machine Image the ntpd service is configured to go to a pool of aws ntp servers hosted on www.pool.ntp.org.
To try and rule this out we setup our own NTP server on AWS that the MongoDB servers could sync to. The issue still occurred so we changed the maxpoll and minpoll time for the ntpd service on the mongo machines to sync the time every 16 seconds from the NTP server but the error is still occurring.
We increased the MongoDB OpLog size as well to see if that would make any difference but it didn’t.
Does anyone else encounter this type of issue? Is there something we are missing?
Cheers,
Colin.
ps -ef |grep ntp;
mongodb1
ntp 5163 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 15865 15839 0 09:31 pts/2 00:00:00 grep ntp
mongodb2
ntp 4834 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 19056 19029 0 09:31 pts/0 00:00:00 grep ntp
mongodb3
ntp 5795 1 0 Dec11 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ec2-user 26199 26173 0 09:31 pts/0 00:00:00 grep ntp
cat /etc/ntp.conf;
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1
# Hosts on local network are less restricted.
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.amazon.pool.ntp.org iburst dynamic
#server 1.amazon.pool.ntp.org iburst dynamic
#server 2.amazon.pool.ntp.org iburst dynamic
#server 3.amazon.pool.ntp.org iburst dynamic
server time-server.domain.com iburst
#broadcast 192.168.1.255 autokey # broadcast server
#broadcastclient # broadcast client
#broadcast 224.0.1.1 autokey # multicast server
#multicastclient 224.0.1.1 # multicast client
#manycastserver 239.255.254.254 # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
# Enable public key cryptography.
#crypto
includefile /etc/ntp/crypto/pw
# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys
# Specify the key identifiers which are trusted.
#trustedkey 4 8 42
# Specify the key identifier to use with the ntpdc utility.
#requestkey 8
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
# Enable additional logging.
logconfig =clockall =peerall =sysall =syncall
# Listen only on the primary network interface.
interface listen eth0
interface ignore ipv6
ntpq -npcrv;
remote refid st t when poll reach delay offset jitter
==============================================================================
*172.31.14.137 91.*.*.* 3 u 557 1024 377 1.121 -0.264 0.161
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5#1.2349-o Sat Mar 23 00:37:31 UTC 2013 (1)",
processor="x86_64", system="Linux/3.14.23-22.44.amzn1.x86_64", leap=00,
stratum=4, precision=-23, rootdelay=23.597, rootdisp=109.962,
refid=172.31.14.137,
reftime=d83a757a.175b5fa1 Tue, Dec 16 2014 9:10:18.091,
clock=d83a77a7.82431efa Tue, Dec 16 2014 9:19:35.508, peer=27361,
tc=10, mintc=3, offset=-0.264, frequency=-13.994, sys_jitter=0.000,
clk_jitter=0.358, clk_wander=0.053
After upgrading to MongoDB 3 using the WiredTiger storage engine we do not see this issue any more.
I am running Ubuntu 64 bit on top of Oracle VM installed on Windows 7 Operating System.
This is the error message I am getting
stop: Rejected send message, 1 matched rules; type="method-call", sender=":1.10" (uid=1000) pid=1084 comm="stop networking"; interface="com.ubuntu.Upstart0_6.job" member="Stop" "error_name"="unset" requested_reply="0" destination="com.ubuntu.Upstart" (uid= 0 pid=1 comm="/sbin/init")
start: Rejected send message, 1 matched rules; type="method-call", sender=":1.11" (uid=1000) pid=1085 comm="stop networking"; interface="com.ubuntu.Upstart0_6.job" member="Stop" "error_name"="unset" requested_reply="0" destination="com.ubuntu.Upstart" (uid= 0 pid=1 comm="/sbin/init")
Is this because of any settings I misconfigured?
What exactly it could be?
I was trying to restart the networking service to get the eth1 interface working.
Instead of doing netwrking service restart it I just ran following command at the commandline.
$sudo ifup -v eth1
And the interface started showing up
Though this is not a solution but it worked for me. Still I dont know what was wrong above.
I have tried to create a ceph filesystem in a single host, for testing purposes, with the following conf file
[global]
log file = /var/log/ceph/$name.log
pid file = /var/run/ceph/$name.pid
[mon]
mon data = /srv/ceph/mon/$name
[mon.mio]
host = penny
mon addr = 127.0.0.1:6789
[mds]
[mds.mio]
host = penny
[osd]
osd data = /srv/ceph/osd/$name
osd journal = /srv/ceph/osd/$name/journal
osd journal size = 1000 ; journal size, in megabytes
[osd.0]
host = penny
devs = /dev/loop1
/dev/loop1 is formatted with XFS and is actually a file with 500Mbs (although that shouldn't matter much) Everything works pretty much OK, and health shows:
sudo ceph -s
2013-12-12 21:14:44.387240 pg v111: 198 pgs: 198 active+clean; 8730 bytes data, 79237 MB used, 20133 MB / 102 GB avail
2013-12-12 21:14:44.388542 mds e6: 1/1/1 up {0=mio=up:active}
2013-12-12 21:14:44.388605 osd e3: 1 osds: 1 up, 1 in
2013-12-12 21:14:44.388738 log 2013-12-12 21:14:32.739326 osd.0 127.0.0.1:6801/8834 181 : [INF] 2.30 scrub ok
2013-12-12 21:14:44.388922 mon e1: 1 mons at {mio=127.0.0.1:6789/0}
but when I try to mount the filesystem
sudo mount -t ceph penny:/ /mnt/ceph
mount error 5 = Input/output error
Usual answers point to ceph-mds not running, but it's actually working:
root 8771 0.0 0.0 574092 4376 ? Ssl 20:43 0:00 /usr/bin/ceph-mds -i mio -c /etc/ceph/ceph.conf
In fact, I managed to make it work previously using these instructions http://blog.bob.sh/2012/02/basic-ceph-storage-kvm-virtualisation.html verbatim previously, but after I tried again I obtained the same problem. Any idea of what might have failed?
Update as indicated by the comment, dmesg shows a problem
[ 6715.712211] libceph: mon0 [::1]:6789 connection failed
[ 6725.728230] libceph: mon1 127.0.1.1:6789 connection failed
Try to use 127.0.0.1. It looks like the kernel is resolving the hostname, but 127.0.1.1 is weird, and maybe it isn't responding to IPv6 loopback.