One openshift-origin worker node won't resolv cluster.local records, causing Imagepullbackoff - linux

We have setup an okd 3.11 cluster with 100+ nodes. Everything was working fine but then a worker node stopped resolving the registry service internal url. This causes new pods to be scheduled to that node fail with ImagePullBackoff error.
Failed to pull image "docker-registry.default.svc:5000/app-name/app-name:latest": rpc error: code = Unknown desc = Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.*.*.71:53: server misbehaving
We tried running nslookup on the worker node and following were the results
While this doesn't work (while it works on other nodes)
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local
Server: 10.*.*.71
Address: 10.*.*.71#53
** server can't find docker-registry.default.svc.cluster.local: SERVFAIL
This works just fine.
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local 127.0.0.1
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: docker-registry.default.svc.cluster.local
Address: 172.*.*.212
Adding server=/cluster.local/172.30.0.1 to dnsmasq conf file /etc/dnsmasq.d/origin-upstream-dns.conf works as a work around but can't find what is causing this.
I have tried adding -q to dnsmasq service's ExecStart and it shows that the dnsmasq won't query the openshift dns running locally at 127.0.0.1:53.
Dnsmasq config/resolv.conf is in order on the node.
I have tried restarting dnsmasq/NetworkManager/Docker, I have tried respawning ovs/sdn pods but still no help.

Found some documented evidence that dnsmasq can behave like that.
It has been suggested by some RedHat articles that a long running dnsmasq service may misbehave and stop resolving names. Similar cases have been reported for openshift environment as well.
The links below suggest that restarting the service would solve the problem for some time and then the issue may resurface. As stated earlier, in my case service restart didn't help but oldest remedy in IT worked (rebooting the node solved the problem).
Reference:
https://access.redhat.com/solutions/3393141
https://bugzilla.redhat.com/show_bug.cgi?id=1560489

Related

Pods can't resolve external DNS

I have a problem with k8s hosted on my own bare-metal infrastructure.
The k8s was installed via kubeadm init without special configuration, and then I apply CNI plugin
Everything works perfectly expects external DNS resolution from Pod to the external world (internet).
For example:
I have Pod with the name foo, if I invoke command curl google.com I receive error
curl: (6) Could not resolve host: google.com
but if I invoke the same command on the same pod a second time I receive properly HTML
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
here.
</BODY></HTML>
and if I repeat this command again I can receive errors with DNS resolution or HTML and so on.
this behavior is random sometimes I must hit 10times and get an error and on 11 hits I can receive Html
I also try to debug this error with this guide, but it does not help.
Additional information:
CoreDNS is up and running and have default config
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
name: coredns
and files /etc/resolv.conf looks fine
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
the problem exists on Centos 8(master, kubeadm init) and on Debian 10(node, kubeadm join)
SELinux in on permissive and SWAP is disabled
it is looks like after install k8s and weavenet problem appear even on the host machine.
I'm not certain where the problem came from either k8s or Linux.
It started after I have installed k8s.
what have I missed?
I can suggest using different CNI plugin and setting it up from scratch. Remember when using kubeadm , apply CNI plugin after you ran kubeadm init, then add worker nodes. Here you can find supported CNI plugins. If the problem still exists, it's probably within your OS.
Check /etc/resolv.conf. The conf file can set the nameserver to 8.8.8.8.

Rabbit MQ changing hostname while preserving rabbitMQ artifacts and messages

This question is regarding rabbitmq config
I hope this question is appropriate for stackoverflow forum.
Please point me to right forum if it isnt
My problem statement that I need to to change hostname of a linux server from "thishost" to "thathost"
The host "thishost" has RabbitMQ installed on it with a ton of artifacts and messages
I need to be able to preserve all the RabbitMQ artifacts such as queues, exchanges and also messages when the hostname changes to "thathost"
I am considering configuration change to enable rabbitmq see old hostname (thishost) despite the name change for linux
To ensure that rabbitmq hostname remains same I peg it to the original hostname by configuring following two parameters in the rabbitmq configuration file
/etc/rabbitmq/rabbitmq-env.conf
...
HOSTNAME=thishost
NODENAME=rabbit#thishost
Having done this change in rabbitmq config, I changed the linux hostname to "thathost" and try to start the rabbitmq service.
The rabbitmq service now refuses to start and the error messages are as follows
service rabbitmq-server start
Job for rabbitmq-server.service failed because the control process exited with error code.
See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
journalctl -xe
Nov 30 11:20:07 ubuntula1 systemd[1]: Failed to start RabbitMQ Messaging Server.
Nov 30 11:20:18 ubuntula1 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
The logfile /var/log/rabbitq shows following error
/var/log/rabbitq
ERROR: epmd error for host thishost: nxdomain (non-existing domain)
Any thoughts on
how to fix the rabbitmq config
any alternative way on making rabbitmq agnostic to hostname
is there a better idea to preserve the rabbitmq artifacts across hostnames
Please note I tried following
export import artifacts using rabbitmqctl export__definitions/import_definitions
Store and load messages using rabbitio
However as I mentioned I have a ton of artifacts and messages and the rigor involved that approach makes it error prone so I am searching for a less rigorous approach
Thanks much folks
Going by the error message in logfile "epmd error for host thishost: nxdomain (non-existing domain)"
I stumbled upon this post How to resolve ERROR: epmd error for host nxdomain (non-existing domain)?
While this is not directly relevant, it sure provides a tip that a /etc/hosts entry is needed for mapping old hostname to the same ip address.
With alias for old hostname addded in /etc/hosts my problem was solved :-)
SO to sum it up, if you want to change the hostname of your linux host - then you need to do two things to save your artifacts from becoming unusable after hostname change
Change to rabbitmq configuration as I already described
/etc/rabbitmq/rabbitmq-env.conf
...
HOSTNAME=thishost
Make an alisas in my /etc/hosts to add old hostname mapping to ip address in addition to new hostname as follows
/etc/hosts
...
a.b.c.d thathost thishost
That solved my problem and rabbitmq starts fine with all existing artifacts intact after hostname change

RabbitMQ wont cluster (nxdomain)

I want to set up 2 rabbitmq servers to work in cluster.
When when trying to run
rabbitmqctl join_cluster rabbit#my_rabbit_1.my.domain.name on my_rabbit_1
I get unable to connect to epmd (port 4369) on my_rabbit_2.my.domain.name: nxdomain (non-existing domain)
I use rabbitmq:latest (debian), .erlang.cookie is the same, hosts resolve fine: I can ping both directions, nmap -6 -p 4369 my_rabbit_2.my.domain.nam returns 4369/tcp open epmd
EDIT:
tcpdump shows that while resolving hostname, rabbit or epmd performs not 2 types of DNS query: AAAA for IPv6 and A for IPv4 address, but only IPv4 which fails repeatedly with nxdomain as there is no IPv4 address available. However, it does not try AAAA DNS query, except when trying to run command like rabbitmq -n rabbit#local.machine.domain.name: then it runs AAAA query and outputs successfully. Hence the problem. How do I solve that?
Finally found solution that worked for me. Erlang documentation says that, by default, -proto_dist specifies a protocol for Erlang distribution, which defaults to inet_tcp (TCP over IPv4). So in IPv6-only environment you have to set -proto_dist inet6_tcp flag for erl.
This can be done by adding the following lines to your rabbitmq-env.conf (see RabbitMQ configuration docs):
# For rabbitmq-server
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-proto_dist inet6_tcp"
# For rabbitmqctl
RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp"
Note that rabbitmqctl and rabbitmq-server use different erl settings: I was unable to create cluster without RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp" setting using rabbitmqctl join_cluster rabbit#host.in.my.domain. It should not be necessary in production mode. Also note that RabbitMQ configuration docs advice against using this setting except for debugging.
unable to connect to epmd (port 4369) on my_rabbit_2.my.domain.name: nxdomain (non-existing domain)
This is an error raised when the rabbitmq server is running on a hostname other than what you think it is running on, or when hostname doesn't resolve to what you think it does.
Amusingly enough I had this exact same issue last night when one instance in our cluster failed, came back on a new hostname, and somehow corrupted its internal authentication store etc.
Without the exact dns entries etc for your setup, all I can offer is general troubleshooting steps.
See this StackOverflow question for a resolution that may help you - in particular the answer by Kishor Pawar.
Are you sure you configured rabbitmq to listen on IPV6? Is there a reason you can't bind it to IPV4 as well on 127.0.0.1 for management operations?

Unable to enter a kubernetes pod. Error from server: error dialing backend: dial tcp: lookup (node hostname) on 168.63.129.16:53: no such host

We have deployed a K8S cluster using ACS engine in an Azure public cloud.
We are able to create deployments and services but when we enter a pod using "kubectl exec -ti (pod name) (command)" we are receiving the below error,
Error from server: error dialing backend: dial tcp: lookup (node hostname) on 168.63.129.16:53: no such host
I looked all over the internet and performed all I could to fix this issue but no luck so far.
The OS is Ubuntu and 168.63.129.16 is a public IP from Azure used for DNS.(refer below link)
https://blogs.msdn.microsoft.com/mast/2015/05/18/what-is-the-ip-address-168-63-129-16/
I've already added host entries to /etc/hosts and entries into resolv.conf of the master/node server and nslookup resolves the same. I've also tested by adding --resolv-conf flag to the kubelet but still it fails. I'm hoping that someone from this community can help us fix this issue.
Verify the node on which your pod is running can be resolved and reached from inside the API server container. If you added entries to /etc/resolv.conf on the master node verify they are visible in the APIserver container, if they are not, restarting the API server pod might be helpful
The problem was in VirtualBox layer
sudo ifconfig vboxnet0 up
Solution is taken from here https://github.com/kubernetes/minikube/issues/1224#issuecomment-316411907

Redis not reading configuration file

I am trying to setup a redis server following this guide to provide shared sessions for my Elastic Beanstalk.
I've installed redis on a new ec2 instance, and it's working fine, locally. However, when I tried to connect the project from my Beanstalk to my redis server, I am getting a "connection refused" error.
After some poking around, I found out that my redis only listens to local (I think?)
netstat -l
tcp 0 0 localhost:6379 *:* LISTEN
I have already out bind 0.0.0.0 to /etc/redis/6379.conf, but I suspect that redis is not reading the same configuration file.
My questions:
How do I check if my redis server is actually loading the configuration file? I tried typing spam into the file and sudo service redis_6379 restart expecting errors, but redis starts normally.
Is there another way for me to configure redis to listen to all connections from my VPC?
Edit: Found my answer.
To find out what configuration file is loaded: redis-cli -p 6379 info server
There's 2 parts of the configuration file that I need to change, firstly bind 0.0.0.0 and comment the bind 127.0.0.1 that comes after.
On Linux Ubuntu server 20.04 LTS I was running into a similar issue after reboot of the EC2 server, for me what resolved it (my nodeJs app was running as Ubuntu user I needed to make that path available) was to add to the PATH within /etc/crontab by:
sudo nano /etc/crontab and just comment out the original path in there so you can switch back if required (mine was: PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin ) and replace it with:
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/home/ubuntu/.nvm/versions/node/v12.20.0/bin and that error disappeared for me

Resources