I have an OpenShift, cluster, and periodically when accessing logs, I get:
worker1-sass-on-prem-origin-3-10 on 10.1.176.130:53: no such host" kube doing a connection to 53 on a node.
I also tend to see tcp: lookup postgres.myapp.svc.cluster.local on 10.1.176.136:53: no such host errors from time to time in pods, again, this makes me think that, when accessing internal service endpoints, pods, clients, and other Kubernetes related services actually talk to a DNS server that is assumed to be running on the given node that said pods are running on.
Update
Looking into one of my pods on a given node, I found the following in resolv.conf (I had to ssh and run docker exec to get this output - since oc exec isn't working due to this issue).
/etc/cfssl $ cat /etc/resolv.conf
nameserver 10.1.176.129
search jim-emea-test.svc.cluster.local svc.cluster.local cluster.local bds-ad.lc opssight.internal
options ndots:5
Thus, it appears that in my cluster, containers have a self-referential resolv.conf entry. This cluster is created with openshift-ansible. I'm not sure if this is infra-specific, or if its actually a fundamental aspect of how openshift nodes work, but i suspect the latter, as I haven't done any major customizations to my ansible workflow from the upstream openshift-ansible recipes.
Yes, DNS on every node is normal in openshift.
It does appear that its normal for an openshift ansible deployment to deploy dnsmasq services on every node.
Details.
As an example of how this can effect things, the following https://github.com/openshift/openshift-ansible/pull/8187 is instructive. In any case, if a local node's dnsmasq is acting flakey for any reason, it will prevent containers running on that node from properly resolving addresses of other containers in a cluster.
Looking deeper at the dnsmasq 'smoking gun'
After checking on an individual node, I found that in fact, there was a process indeed bounded to port 53, and it is dnsmasq. Hence,
[enguser#worker0-sass-on-prem-origin-3-10 ~]$ sudo netstat -tupln | grep 53
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 675/openshift
And, dnsmasq is running locally:
[enguser#worker0-sass-on-prem-origin-3-10 ~]$ ps -ax | grep dnsmasq
4968 pts/0 S+ 0:00 grep --color=auto dnsmasq
6994 ? Ss 0:22 /usr/sbin/dnsmasq -k
[enguser#worker0-sass-on-prem-origin-3-10 ~]$ sudo ps -ax | grep dnsmasq
4976 pts/0 S+ 0:00 grep --color=auto dnsmasq
6994 ? Ss 0:22 /usr/sbin/dnsmasq -k
The final clue, resolv.conf itself is even adding the local IP address as a nameserver... And this is obviously borrowed into containers that start.
nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
Generated by NetworkManager
search cluster.local bds-ad.lc opssight.internal
NOTE: the libc resolver may not support more than 3 nameservers.
The nameservers listed below may not be recognized.
nameserver 10.1.176.129
The solution (in my specific case)
In my case , this was happening because the local nameserver was using an ifcfg (you can see these files in /etc/sysconfig/network-scripts/) with
[enguser#worker0-sass-on-prem-origin-3-10 network-scripts]$ cat ifcfg-ens192
TYPE=Ethernet
BOOTPROTO=dhcp
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens192
UUID=50936212-cb5e-41ff-bec8-45b72b014c8c
DEVICE=ens192
ONBOOT=yes
However, my internally configured Virtual Machines could not resolve IPs provided to them by the PEERDNS records.
Ultimately the fix was to work with our IT department to make sure our authoritative domain for our kube clusters had access to all IP addresses in our data center.
The Generic Fix to :53 lookup errors...
If youre seeing the :53 record errors are coming up when you try to kubectl or oc logs / exec, then there is likely that your apiserver is not able to connect with kubelets via their IP address.
If youre seeing :53 record errors in other places, for example, inside of pods, then this is because your pod, using its own local DNS, isnt able to resolve internal cluster IP addresses. This might simply be because you have an outdated controller that is looking for services that don't exist anymore, or else, you have flakiness at your kubernetes dns implementation level.
Related
This question appears to have been asked many times, but the answers appear to be outdated, or just not work.
I'm on a Linux system without a RTC (a raspberry pi). My host runs an ntp daemon (ntpd), which checks the time online as soon as the host boots up, assuming it has internet, and sets the system clock.
The code inside my container needs to know if the host's system clock is accurate (has been updated since last boot).
On the host itself, this is very easy to do - use something like ntpdate -q 127.0.0.1. ntpdate connects to 127.0.0.1:123 over udp, and checks with the ntpd daemon if the clock is accurate (if it's been updated since last boot). This appears to be more difficult to do from within a container.
If I start up a container, and use docker container inspect NAME to see the container's IP, it shows me this:
"Gateway": "172.19.0.1",
"IPAddress": "172.19.0.6",
If I run ntpdate -q 172.19.0.1 within the container, this works. Unfortunately, 172.19.0.1 isn't a permanent IP for the host. It that subnet is already taken when the container is starting up, the subnet will change, so hardcoding this IP is a bad idea. What I need is an environment variable that always reflects the proper IP for the host.
Windows and MacOS versions of docker appear to set the host.docker.internal hostname within containers, but Linux doesn't. Some people recommend setting this in the /etc/hosts file of the host, but then you're just hardcoding the IP, which again, can change.
I run my docker container with a docker-compose.yml file, and apparently, on new versions of docker, you can do this:
extra_hosts:
- "host.docker.internal:host-gateway"
I tried this, and this works. Sort of. Inside my container, host.docker.internal resolves to 172.17.0.1, which is IP of the docker0 interface on the host. While I can ping host.docker.internal from within the container, using ntpdate -q host.docker.internal or ntpdate -q 172.17.0.1 doesn't work.
Is there a way to make host.docker.internal resolve to the proper gateway IP of the host from within the container? In my example, 172.19.0.1.
Note: Yes, I can use code within the container to check what the container's gateway is with netstat or similar, but then I need to complicate my code, making it figure out the IP of the NTP server (the docker host). I can probably also pass the docker socket into the container, and try to get the docker host's IP through that, but that seems super hackey, and an unnecessary security issue.
The best solution I've found is to use the ip command from the iproute2 package, look for the default route and use the gateway address for it.
ip route | awk '/default/ {print $3}'
if you want it in an environment variable, you can set it in an entrypoint script with
export HOST_IP_ADDRESS=$(ip route | awk '/default/ {print $3}')
I am trying to access a host that sits in another server (but on my network) from inside the pod of deployment and I am using microk8s.
The thing is that on the server where I have microk8s installed I can easily ping it by ping my-network-host.qa.local. But when I go inside the pod with microk8s kubectl exec -it pod_name -- /bin/bash and I do ping my-network-host.qa.local it says: Name or service not known.
And when I connect to a VPN on my computer to be on that network and I deploy it locally using docker-desktop kubernetes I can ping that host from within the pod. So I think the problem sits in microk8s which is not letting my pod use my network.
Is there any way to tell microk8s to use my hosts from my network?
p.s. I can ping the ip of that server from the pod, but I am not being able to ping the host from the pod
Based on another answer that I found on StackOverflow, i managed to fixed it.
There were 2 changes needed to make it work:
Update kubelet configuration to use resolv-conf:
sudo echo "--resolv-conf=/run/systemd/resolve/resolv.conf" >> /var/snap/microk8s/current/args/kubelet
Restart kubelet service:
sudo service snap.microk8s.daemon-kubelet restart
Then change the CoreDNS forward to point to your nameserver:
First open coredns config map so you can edit it
sudo microk8s.kubectl edit configmap coredns -n kube-system
and update the file at
forward . 8.8.8.8 8.8.4.4 #REMOVE THIS LINE
forward . xxx.xxx.xxx.xxx #ADD THIS WITH YOUR IP
You can get eth0 DNS address:
nmcli dev show 2>/dev/null | grep DNS | sed 's/^.*:\s*//'
On my case I already had the ip as nameserver on /run/systemd/resolve/resolv.conf.
Now just save the changes, and go inside your pods so you will be able to access them.
There is a comment in another post that suggest adding
forward . /etc/resolv.conf
But that didnt work on my case.
We have setup an okd 3.11 cluster with 100+ nodes. Everything was working fine but then a worker node stopped resolving the registry service internal url. This causes new pods to be scheduled to that node fail with ImagePullBackoff error.
Failed to pull image "docker-registry.default.svc:5000/app-name/app-name:latest": rpc error: code = Unknown desc = Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.*.*.71:53: server misbehaving
We tried running nslookup on the worker node and following were the results
While this doesn't work (while it works on other nodes)
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local
Server: 10.*.*.71
Address: 10.*.*.71#53
** server can't find docker-registry.default.svc.cluster.local: SERVFAIL
This works just fine.
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local 127.0.0.1
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: docker-registry.default.svc.cluster.local
Address: 172.*.*.212
Adding server=/cluster.local/172.30.0.1 to dnsmasq conf file /etc/dnsmasq.d/origin-upstream-dns.conf works as a work around but can't find what is causing this.
I have tried adding -q to dnsmasq service's ExecStart and it shows that the dnsmasq won't query the openshift dns running locally at 127.0.0.1:53.
Dnsmasq config/resolv.conf is in order on the node.
I have tried restarting dnsmasq/NetworkManager/Docker, I have tried respawning ovs/sdn pods but still no help.
Found some documented evidence that dnsmasq can behave like that.
It has been suggested by some RedHat articles that a long running dnsmasq service may misbehave and stop resolving names. Similar cases have been reported for openshift environment as well.
The links below suggest that restarting the service would solve the problem for some time and then the issue may resurface. As stated earlier, in my case service restart didn't help but oldest remedy in IT worked (rebooting the node solved the problem).
Reference:
https://access.redhat.com/solutions/3393141
https://bugzilla.redhat.com/show_bug.cgi?id=1560489
It appears that local-up-cluster in kubernetes, on ubuntu, isn't able to resolve DNS queries when relying on cluster DNS.
setup
I'm running an ubuntu box, with environmental variables for DNS set in local-up-cluster:
# env | grep KUBE
KUBE_ENABLE_CLUSTER_DNS=true
KUBE_DNS_SERVER_IP=172.17.0.1
running information
sky-dns seems happy:
I0615 00:04:13.563037 1 server.go:198] Skydns metrics enabled (/metrics:10055)
I0615 00:04:13.563051 1 dns.go:147] Starting endpointsController
I0615 00:04:13.563054 1 dns.go:150] Starting serviceController
I0615 00:04:13.563125 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0615 00:04:13.563141 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0615 00:04:13.589840 1 dns.go:264] New service: kubernetes
I0615 00:04:13.589971 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0615 00:04:14.063246 1 dns.go:171] Initialized services and endpoints from apiserver
I0615 00:04:14.063267 1 server.go:129] Setting up Healthz Handler (/readiness)
I0615 00:04:14.063274 1 server.go:134] Setting up cache handler (/cache)
I0615 00:04:14.063288 1 server.go:120] Status HTTP port 8081
kube-proxy seems happy:
I0615 00:03:53.448369 5706 proxier.go:864] Setting endpoints for "default/kubernetes:https" to [172.31.44.133:6443]
I0615 00:03:53.545124 5706 controller_utils.go:1001] Caches are synced for service config controller
I0615 00:03:53.545146 5706 config.go:210] Calling handler.OnServiceSynced()
I0615 00:03:53.545208 5706 proxier.go:979] Not syncing iptables until Services and Endpoints have been received from master
I0615 00:03:53.545125 5706 controller_utils.go:1001] Caches are synced for endpoints config controller
I0615 00:03:53.545224 5706 config.go:110] Calling handler.OnEndpointsSynced()
I0615 00:03:53.545274 5706 proxier.go:309] Adding new service port "default/kubernetes:https" at 10.0.0.1:443/TCP
I0615 00:03:53.545329 5706 proxier.go:991] Syncing iptables rules
I0615 00:03:53.993514 5706 proxier.go:991] Syncing iptables rules
I0615 00:03:54.008738 5706 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
I0615 00:04:24.008904 5706 proxier.go:991] Syncing iptables rules
I0615 00:04:24.023057 5706 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
result
However, I don't seem to be able to resolve anything inside the cluster, same result with docker exec or kube exec:
➜ kubernetes git:(master) kc exec --namespace=kube-system kube-dns-2673147055-4j6wm -- nslookup kubernetes.default.svc.cluster.local
Defaulting container name to kubedns.
Use 'kubectl describe pod/kube-dns-2673147055-4j6wm' to see all of the
containers in this pod.
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'kubernetes.default.svc.cluster.local': Name does not resolve
question
Whats the simplest way to further debug a system created using local-up-cluster where the DNS pods are running, but kubernetes.default.svc.cluster.local is not resolved ? Note that all other aspects of this cluster appear to be working perfectly.
System info : Linux ip-172-31-44-133 4.4.0-1018-aws #27-Ubuntu SMP Fri May 19 17:20:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.
Example of resolv.conf that is being placed in my containers...
/etc/cfssl # cat /etc/resolv.conf
nameserver 172.17.0.1
search default.svc.cluster.local svc.cluster.local cluster.local dc1.lan
options ndots:5
I can't comment on your post so I'll attempt to answer this.
First of all, certain Alpine images have trouble resolving using nslookup. DNS might in fact be working normally in your cluster.
To validate this, read the logs of the pods (eg. traefik, heapster, calico) that communicates with kube-apiserver. If no errors are observed, what you have is probably a non-problem.
If you want to be doubly-sure, deploy a non-Alpine pod and try nslookup.
If it really is a DNS issue, I will debug in this sequence.
kubectl exec into the kube-dns pod. Run nslookup kubernetes.default.svc.cluster.local localhost. If this works, DNS is in fact running. If it doesn't, kube-dns should have entered a CrashLoopbackOff state by now.
kubectl exec into a deployed pod. Run nslookup kubernetes.default.svc.cluster.local <cluster-ip>. If this works, you're good to go. If it doesn't, something is up with the pod network. Without details, I can't recommend further steps.
Bonne chance!
I figured I'd post a systematic answer that usually works for me. I was hoping for something more elegant, and this isn't ideal, but I think its the best place to start.
1) Make sure your DNS nanny and your SkyDNS are running. The nanny and Sky DNS should both show in their docker logs that they've bound to a port.
2) When you create new services, make sure that SkyDNS is writing them to the logs and showing the creation of SRV and so on.
3) Look in /etc/resolv.conf in your docker containers. Make sure the nameserver looks like something on your internal docker IP addresses (i.e. 10.... in a regular docker0 config on fedora)
There are specific env variables you need to export correctly: API_HOST=true and KUBE_ENABLE_CLUSTER_DNS=true.
Theres alot of deeper tools you can use, like route -n and so on to debug container networking even more, but local up cluster should generally 'just work' and if the above steps surface something supsicious, its worth mentioning in the kubernetes community as a possible bug.
I am trying to force the docker daemon to use my DNS server which is binded to bridge0 interface.
I have added --dns 172.17.42.1 in my docker_opts but no success
DNS server reply ok with dig command:
dig #172.17.42.1 registry.service.consul SRV +short
1 1 5000 registry2.node.staging.consul.
But pull with this domain fails:
docker pull registry.service.consul:5000/test
FATA[0000] Error: issecure: could not resolve "registry.service.consul": lookup registry.service.consul: no such host
PS: By adding nameserver 172.17.42.1 in my /etc/resolv.conf solve the issue but the DNS has to be exclusively for docker commands.
Any idea ?
You passed --dns 172.17.42.1 to docker_opts, so since that you should be able to resolve the container hostnames from inside other containers. But obviously you're doing docker pull from the host, not from the container, isn't it? Therefore it's not surprising that you cannot resolve container's hostname from your host, because it is not configured to use 172.17.42.1 for resolving.
I see two possible solutions here:
Force your host to use 172.17.42.1 as DNS (/etc/resolv.conf etc).
Create a special container with Docker client inside and mount docker.sock inside it. This will make you able to use all client commands including pull:
docker run -d -v /var/run/docker.sock:/var/run/docker.sock:rw --name=client ...
docker exec -it client docker pull registry.service.consul:5000/test