RabbitMQ wont cluster (nxdomain) - dns

I want to set up 2 rabbitmq servers to work in cluster.
When when trying to run
rabbitmqctl join_cluster rabbit#my_rabbit_1.my.domain.name on my_rabbit_1
I get unable to connect to epmd (port 4369) on my_rabbit_2.my.domain.name: nxdomain (non-existing domain)
I use rabbitmq:latest (debian), .erlang.cookie is the same, hosts resolve fine: I can ping both directions, nmap -6 -p 4369 my_rabbit_2.my.domain.nam returns 4369/tcp open epmd
EDIT:
tcpdump shows that while resolving hostname, rabbit or epmd performs not 2 types of DNS query: AAAA for IPv6 and A for IPv4 address, but only IPv4 which fails repeatedly with nxdomain as there is no IPv4 address available. However, it does not try AAAA DNS query, except when trying to run command like rabbitmq -n rabbit#local.machine.domain.name: then it runs AAAA query and outputs successfully. Hence the problem. How do I solve that?

Finally found solution that worked for me. Erlang documentation says that, by default, -proto_dist specifies a protocol for Erlang distribution, which defaults to inet_tcp (TCP over IPv4). So in IPv6-only environment you have to set -proto_dist inet6_tcp flag for erl.
This can be done by adding the following lines to your rabbitmq-env.conf (see RabbitMQ configuration docs):
# For rabbitmq-server
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-proto_dist inet6_tcp"
# For rabbitmqctl
RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp"
Note that rabbitmqctl and rabbitmq-server use different erl settings: I was unable to create cluster without RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp" setting using rabbitmqctl join_cluster rabbit#host.in.my.domain. It should not be necessary in production mode. Also note that RabbitMQ configuration docs advice against using this setting except for debugging.

unable to connect to epmd (port 4369) on my_rabbit_2.my.domain.name: nxdomain (non-existing domain)
This is an error raised when the rabbitmq server is running on a hostname other than what you think it is running on, or when hostname doesn't resolve to what you think it does.
Amusingly enough I had this exact same issue last night when one instance in our cluster failed, came back on a new hostname, and somehow corrupted its internal authentication store etc.
Without the exact dns entries etc for your setup, all I can offer is general troubleshooting steps.
See this StackOverflow question for a resolution that may help you - in particular the answer by Kishor Pawar.
Are you sure you configured rabbitmq to listen on IPV6? Is there a reason you can't bind it to IPV4 as well on 127.0.0.1 for management operations?

Related

Rabbit MQ changing hostname while preserving rabbitMQ artifacts and messages

This question is regarding rabbitmq config
I hope this question is appropriate for stackoverflow forum.
Please point me to right forum if it isnt
My problem statement that I need to to change hostname of a linux server from "thishost" to "thathost"
The host "thishost" has RabbitMQ installed on it with a ton of artifacts and messages
I need to be able to preserve all the RabbitMQ artifacts such as queues, exchanges and also messages when the hostname changes to "thathost"
I am considering configuration change to enable rabbitmq see old hostname (thishost) despite the name change for linux
To ensure that rabbitmq hostname remains same I peg it to the original hostname by configuring following two parameters in the rabbitmq configuration file
/etc/rabbitmq/rabbitmq-env.conf
...
HOSTNAME=thishost
NODENAME=rabbit#thishost
Having done this change in rabbitmq config, I changed the linux hostname to "thathost" and try to start the rabbitmq service.
The rabbitmq service now refuses to start and the error messages are as follows
service rabbitmq-server start
Job for rabbitmq-server.service failed because the control process exited with error code.
See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
journalctl -xe
Nov 30 11:20:07 ubuntula1 systemd[1]: Failed to start RabbitMQ Messaging Server.
Nov 30 11:20:18 ubuntula1 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
The logfile /var/log/rabbitq shows following error
/var/log/rabbitq
ERROR: epmd error for host thishost: nxdomain (non-existing domain)
Any thoughts on
how to fix the rabbitmq config
any alternative way on making rabbitmq agnostic to hostname
is there a better idea to preserve the rabbitmq artifacts across hostnames
Please note I tried following
export import artifacts using rabbitmqctl export__definitions/import_definitions
Store and load messages using rabbitio
However as I mentioned I have a ton of artifacts and messages and the rigor involved that approach makes it error prone so I am searching for a less rigorous approach
Thanks much folks
Going by the error message in logfile "epmd error for host thishost: nxdomain (non-existing domain)"
I stumbled upon this post How to resolve ERROR: epmd error for host nxdomain (non-existing domain)?
While this is not directly relevant, it sure provides a tip that a /etc/hosts entry is needed for mapping old hostname to the same ip address.
With alias for old hostname addded in /etc/hosts my problem was solved :-)
SO to sum it up, if you want to change the hostname of your linux host - then you need to do two things to save your artifacts from becoming unusable after hostname change
Change to rabbitmq configuration as I already described
/etc/rabbitmq/rabbitmq-env.conf
...
HOSTNAME=thishost
Make an alisas in my /etc/hosts to add old hostname mapping to ip address in addition to new hostname as follows
/etc/hosts
...
a.b.c.d thathost thishost
That solved my problem and rabbitmq starts fine with all existing artifacts intact after hostname change

One openshift-origin worker node won't resolv cluster.local records, causing Imagepullbackoff

We have setup an okd 3.11 cluster with 100+ nodes. Everything was working fine but then a worker node stopped resolving the registry service internal url. This causes new pods to be scheduled to that node fail with ImagePullBackoff error.
Failed to pull image "docker-registry.default.svc:5000/app-name/app-name:latest": rpc error: code = Unknown desc = Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.*.*.71:53: server misbehaving
We tried running nslookup on the worker node and following were the results
While this doesn't work (while it works on other nodes)
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local
Server: 10.*.*.71
Address: 10.*.*.71#53
** server can't find docker-registry.default.svc.cluster.local: SERVFAIL
This works just fine.
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local 127.0.0.1
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: docker-registry.default.svc.cluster.local
Address: 172.*.*.212
Adding server=/cluster.local/172.30.0.1 to dnsmasq conf file /etc/dnsmasq.d/origin-upstream-dns.conf works as a work around but can't find what is causing this.
I have tried adding -q to dnsmasq service's ExecStart and it shows that the dnsmasq won't query the openshift dns running locally at 127.0.0.1:53.
Dnsmasq config/resolv.conf is in order on the node.
I have tried restarting dnsmasq/NetworkManager/Docker, I have tried respawning ovs/sdn pods but still no help.
Found some documented evidence that dnsmasq can behave like that.
It has been suggested by some RedHat articles that a long running dnsmasq service may misbehave and stop resolving names. Similar cases have been reported for openshift environment as well.
The links below suggest that restarting the service would solve the problem for some time and then the issue may resurface. As stated earlier, in my case service restart didn't help but oldest remedy in IT worked (rebooting the node solved the problem).
Reference:
https://access.redhat.com/solutions/3393141
https://bugzilla.redhat.com/show_bug.cgi?id=1560489

ECONNREFUSED when using node with nano and couchdb

I was using nodejs + nano + couchdb for my application successfully up until today. For some reason all of a sudden I'm getting ECONNREFUSED when I try to run my application. If I try to query the database using the web browser or using a different application (java application) it works fine. I'm uncertain why just in this scenario it stopped working. I've been researching for the past 2 days and can't find any help. I believe this might have something to do with too many open connections, but that's a little bit out of my realm of knowledge. Can anyone provide me with any insight on debugging this issue or any direction I could go in? I should mention this couchdb lives on iriscouch
Add more information about stack that you're using. But basically it's server machine doesn't want to allow connecting. Also try run your app with DEBUG=*, nano will log via console.log almost everything.
E.g. change in package.json start command to node changetoyourapp.js DEBUG=*
I faced yesterday same issue with nodejitsu/iriscouch. Issue disappeared after some restarts.
Check the version of your node vs the expected node version of nano. It is possible that nano does not work with node > 16.
This is down to Node v18 now preferring an IPv6 address over and IPv4 address if two exist for the same hostname.
i.e. if your /etc/hosts contains entries like this:
127.0.0.1 localhost
::1 localhost
Node v16 will say that "localhost" resolves to 127.0.0.1 where Node v18 will say "localhost" resolves to ::1, the IPv6 equivalent. As CouchDB doesn't listen on an IPv6 port by default, then a connection to ::1 will be refused.
Solutions:
Use 127.0.0.1 instead of localhost in your URLs.
Use a domain name that resolves unambiguously to an IPv4 address e.g. 127.0.0.1 my.pretend.host in your /etc/hosts file.
Revert to Nodev16 which preferred IPv4 addresses in its dns lookup.
Make CouchDB bind to an IPv6 address by changing bind_address = ::1 in couchdb.ini. You can then do curl 'http://USER:PASS#[::1]:5984/.
See
https://github.com/apache/couchdb-nano/issues/313#issuecomment-1321760360

The controller is not available at localhost JBOSS.7.1.1.FINAL

When i run the jboss-cli.sh,
I get this message.
[root bin]# sh jboss-cli.sh
You are disconnected at the moment. Type 'connect' to connect to the server or 'help' for the list of supported commands.
[disconnected /] connect localhost
The controller is not available at localhost:9999
[disconnected /] connect
The controller is not available at localhost:9999
[disconnected /] connect localhost:9999
The controller is not available at localhost:9999
[disconnected /]
Also i have another installation of jboss5 GA. I hope that is not interfering.
Although that is totally shut down for now.
Native management interface is :9999 in standalone.sh
Please throw light on this issue.
#
EDITED
#
When i stop my service with "service jboss stop"
i get this message
[root# bin]# *** JBossAS process (7302) received KILL signal ***
grep: /var/run/jboss-as/jboss-as-standalone.pid: No such file or directory
I Dont know how to check whether server is listening on the port 9999 or not.
Few more details
[root bin]# netstat -anp |grep 9999
tcp 0 0 127.0.0.1:9999 0.0.0.0:* LISTEN 7931/java
[root bin]# netstat -anp |grep 8080
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 7931/java
JBoss processs id and the server id acquiring these ports is same.
This question has two issues ,
First, i have provided debuging parameter in the startup script.
If you see 8787 that means you have somewhere provided debuging argument.
Second and the most important one controller not available #localhost or #IPADDRESS .
Please check if you have used port offset, as it increments all the ports by the number with with you have set port offset.
Suppose port offset is 2
Then try to access connect localhost:10001 Port i.e 9999+2
On my production server sometimes it does not works with localhost , but works with IP address.
Then try to access connect IPADDRESS:9999
OR
Then try to access connect 127.0.0.1:9999
Please check in the firewall weather the port 9999 or what ever with port offset, if the port is not open in the firewall it gives error,
I asked this question 6 months back and the above checks has solved
the problem always.
This is probaby because you have changed your binding configuration and jboss does not bind to 127.0.0.1.
In case your jboss instance is not binding to 127.0.0.1, you may use --controller option as follows:
./jboss-cli.sh --controller=YOUR_IP:9999
Use netstat -anp |grep 9999 to find out if port 9999 is in use and by which process id. You could also check the host.xml used by the controller to configure the proper native port.
In the host xml, you should find the default port:
<native-interface security-realm="ManagementRealm">
<socket interface="management" port="${jboss.management.native.port:9999}"/>
./jboss-cli.sh --controller=localhost:9999 --connect
You open the debug-port with jboss-cli.sh. Either you activated in jboss-cli.sh:
# Sample JPDA settings for remote socket debugging
# JAVA_OPTS="$JAVA_OPTS -Xrunjdwp:transport=dt_socket,address=8787,server=y,suspend=n"
or you set JAVA_OPTS with such an option in you environment. See
echo $JAVA_OPTS
I guess you did this for two jboss-processes, and you get a port-conflict. See
netstat -nap | grep 8787
I recently faced this issue and the root cause that I found was completely different than it is listed above. It is because for some other project I shifted to JDK 1.8 from 1.7. Boom! and error started coming up...I took hell lot of time figuring out why it is coming up before finally realizing I changed my JDK version.
It might be because JBOSS 7 doesn't work with 1.8 of which I have limited knowledge but yes this might prove useful for some cases.

Linux Debian SSH connection to another machine has delay after network settings change

Hi StackOverflow members,
I have an issue with ssh connection on my Debian 7 system to a remote OpenSSH server located on the same network. It looks like there is some network configuration problem but I cann't find where it lays. This two debian machines are connect with a switch that is NOT connected to a router. So the two machines have no internet connection.
A-Debian 7
IP: 192.168.1.2
MASK: 255.255.255.0
GW: 192.168.1.1
B-Debian 7
IP: 192.168.1.3
MASK: 255.255.255.0
GW: 192.168.1.1
With that configuration the ssh command prompts my for a password in less then a second. But the with the following network configuration I get the password prompt after a 10+ second delay:
A-Debian 7
IP: 10.10.1.83
MASK: 255.255.255.128
GW: 10.10.1.1
B-Debian 7
IP: 10.10.1.82
MASK: 255.255.255.128
GW: 10.10.1.1
The ssh connection from the server A -> B runs with both configs on custom 1111 port.
The B machine has also a Web server running on port 8080 that has no delays with both net configurations.
Thank you in advance for any clues or tips how to solve that problem.
SOLVED: Removing of the gateway parameter "GW: 10.10.1.1" in the network settings has solved the problem.
The usual culprits here are IPv6 and DNS lookups.
SSH might try to connect via IPv6, first, but the timeout is too low for that. You can see whether IPv6 is enabled with
cat /proc/sys/net/ipv6/conf/eth0/disable_ipv6
To disable:
echo 1 > /proc/sys/net/ipv6/conf/eth0/disable_ipv6
The second culprit is DNS; my guess is that DNS lookups don't work correctly with the second configuration. Try host www.google.com to test this theory.
If that also has a delay, you need to fix your DNS setup.
If that's not it, check the rest of your networking parameters: Gateway, cables, etc.
Start to ping the other host. Is that fast & reliable?
Next, try remote login (ssh, telnet). Note that you can give telnet a port to connect to, so if you have DB server running, you can still use telnet to connect to the server. It will print an error but it allows you to test the TCP/IP connection without any extra error sources.

Resources