[Question posted by a user on YugabyteDB Community Slack]
I have a single node Yugabyte 2.12.3 instance setup to use the public server address.
When I try to change services to bind to localhost I can't properly start the master service, I got an error:
UNKNOWN_ROLE
ERROR: Network error (yb/util/net/socket.cc:551):
Unable to get registration information for peer ([10.20.12.246:7100]) id (fad4f3b477364900a15679cd954bf6b5): recvmsg error: Connection refused (system error 111)
What is needed to adjust to starting master service normally?
Does this localhost master service have info about the previous node setup and try contacting that master node?
I don’t think we support changing the bind/broadcast address, after the fact. We persist address information, in various structures, across servers, eg: for each tablet, we store the IP in each raft group.
Related
MarkLogic 9.0.9
Deployed in Azure with Managed Disk
While setting up new MarkLogic Cluster, we are facing an issue for 2 server nodes as below
This host is down. The following error occurred while trying to contact it:
XDMP-HOSTOFFLINE: Host is offline or not responding
Host <HostName>
Online Disconnected
While looking at error log, I got this line
2020-05-06 05:22:28.832 Warning: A valid hostname is required for proper functioning of MarkLogic Server: SVC-SOCHN: Socket hostname error: getaddrinfo .reddog.microsoft.com: Name or service not known (where as it should connect to )
I got knowledge base article which is published in April 2020.
https://help.marklogic.com/Knowledgebase/Article/View/svc-sochn-warning-during-start-up-on-aws
Based on this article, I do not find any file under /etc/ or /var/local folders as mentioned in article
Not sure if it is because of this, I am not able to open MarkLogic Admin Interface (port 8001).
It seems that somewhere in the MarkLogic configuration this name is there, but which one is a question.
Please find below screen from host within MarkLogic Interface. In this case, disconnected status is for 01 & 03
Whereas I can access Admin Interface of 01, so I am wondering.
After discussing same issue with infra team, they found issue with DNS resolution as full dns was not set in hostname within MarkLogic.
i.e. ml-01 was set in hostname instead of ml-01.abc.com and then as MarkLogic was in azure, it added ml.01.reddog.microsoft.com automatically.
So outside MarkLogic we were able to ping server with full name.
After change in DNS resolution, i was able to add ML server nodes in cluster.
We're using k8s 1.9.3 managed via kops 1.9.3 in AWS with Gossip based DNS using the weave cni network plugin.
I was doing a rolling-update of the master IG's to enable a some additional admission controllers. (PodNodeSelector and PodTolerationRestriction) I did this in two other clusters with no problems. When the cluster got to rolling the third master (we run our cluster in a 3 master setup) it brought down the instance and tried to bring up the new master instance but the new master instance failed to join the cluster. Upon further research and subsequent attempts to roll the third master to bring it into the cluster I found that the third, failing to join master, keeps trying to join the cluster as the old masters ip address. Even though it's ip address is something different. Watching a kubectl get nodes | grep master shows that the cluster thinks it's the old ip address and it fails because it's not that ip anymore. It seems that for some reason the cluster gossip based DNS is not getting notified about the new master's ip address.
This is causing problems because the kubernetes svc still has the old master's ip address in it, which is causing any api requests that get directed to that non-existent backend master to fail. It is also causing problems for etcd which keeps trying to contact it on the old ip address. Lots of logs like this:
018-10-29 22:25:43.326966 W | etcdserver: failed to reach the peerURL(http://etcd-events-f.internal.kops-prod.k8s.local:2381) of member 3b7c45b923efd852 (Get http://etcd-events-f.internal.kops-prod.k8s.local:2381/version: dial tcp 10.34.6.51:2381: i/o timeout)
2018-10-29 22:25:43.327088 W | etcdserver: cannot get the version of member 3b7c45b923efd852 (Get http://etcd-events-f.internal.kops-prod.k8s.local:2381/version: dial tcp 10.34.6.51:2381: i/o timeout)
One odd thing is that if I run etcdctl cluster-health on the available masters etcd instances they all show the unhealthy member id as f90faf39a4c5d077 but when I look at the etcd-events logs I see that it sees the unhealth member id as 3b7c45b923efd852. So there seems to be some inconsistency with etcd.
Since we are running in a three node master setup with one master down we don't want to restart any of the other masters to try to fix the problem because we're afraid to lose quorum on the etcd cluster.
We use weave 2.3.0 as our network CNI provider.
Noticed on the failing master that the weave cni config /etc/cni/net.d/10-weave.conf isn't getting created and the /etc/hosts files on the working masters isn't properly getting updated with the new master ip address. It seems like kube-proxy isn't getting the update for some reason.
Running the default debian 8 (jessie) image that is provided with kops 1.9.
How can we get the master to properly update DNS with it's new ip address?
My co-worker found that the fix was restarting the kube-dns and kube-dns-autoscaler pods. We're still not sure why they were failing to update dns with the new master ip but after restarting them adding the new master to the cluster worked fine.
We are using the latest Apache Cassandra database server, and the Datastax Node.js client, running in the cloud.
When our Cassandra servers are rebuilt, they get new IP addresses. Then any running service clients can't find the new servers, the client driver obviously must cache the IP addresses, instead of using DNS.
Is there some way around this problem, other than doing client shutdown and get a new client, in our services when we encounter an error accessing the database?
If you only have 1 server, there is nothing you can do.
Otherwise the node when it rebuilds (if it is a single node in the cluster of many) will advertise the new IP to the cluster and cluster topology is updated. So the peers table will be updated and the driver can register this event (AFAIK).
But why not use private static addresses for your cassandra nodes?
I have installed a cluster including 3 nodes on amazon Ec2. I just stopped all instances , however after restarting all insatnces while accesingth e control console using 9443 port it gives me connection refuse error
Do I neeed to restart the MapR services and how?
Thnaks
Did you check the status of Webserver as it provides access to MCS.
My objective is to access my Hbase cluster on Azure with Squirrel with a Phoenix driver running on my local computer.
My Hbase cluster on Azure is operational. I can see it in the Ambari dashboard and I can access it using SSH. I can start Phoenix with the sqlline.py command pointing to one of the zookeeper nodes. The !tables command returs four lines.
My Hbase cluster is included in an Azure VNet. From my local computer (running Windows 10) I can connect to this VNet. I can ping the IP address (10.254.x.x) of the zookeeper node successfully but pinging the FQDN of the zookeeper node results in an error message:
"Ping request could not find host zk1-.......ax.internal.cloudapp.net.
Please check the name and try again."
When I start Squirrel on my local computer with the URL pointing to the FQDN of the zookeeper node I get an error message:
"Unexpected Error occurred attempting to open an SQL connection". The
stack trace points to a java.util.concurrent.RuntimeException: "Unable
to establish connection"
When I start Squirrel on my local computer with the URL pointing to the IP address of the zookeeper node I get a different error:
"Unexpected Error occurred attempting to open an SQL connection". The
stack trace points to a java.util.concurrent.TimeoutException.
I suspect this has something to do with the Domain Name resolution problem as described here [https://superuser.com/questions/966832/windows-10-dns-resolution-via-vpn-connection-not-working]. I applied the resolution as described by LikeARock47 on Feb 23. This did not improve the situation however.
Does this indeed have to do with the Domain Name resolution issue or is the problem somewhere else?
Is there a better solution to the Domain Name resolution issue?
A JDBC connection from Squirrel on my local Windows10 computer has succesfully been established to the Hbase cluster by using the zookeeper IP address and the port and "/hbase-unsecure":
jdbc:phoenix:10.254.x.x:2181:/hbase-unsecure
I can manage my HBase cluster with a local Squirrel now!
I'd still be interested to find out how I can get the zookeeper FQDN resolved locally.....