How can I root cause DNS lookup failures in a (local) kubernetes instance (where sky-dns is healthy) - dns

It appears that local-up-cluster in kubernetes, on ubuntu, isn't able to resolve DNS queries when relying on cluster DNS.
setup
I'm running an ubuntu box, with environmental variables for DNS set in local-up-cluster:
# env | grep KUBE
KUBE_ENABLE_CLUSTER_DNS=true
KUBE_DNS_SERVER_IP=172.17.0.1
running information
sky-dns seems happy:
I0615 00:04:13.563037 1 server.go:198] Skydns metrics enabled (/metrics:10055)
I0615 00:04:13.563051 1 dns.go:147] Starting endpointsController
I0615 00:04:13.563054 1 dns.go:150] Starting serviceController
I0615 00:04:13.563125 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0615 00:04:13.563141 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0615 00:04:13.589840 1 dns.go:264] New service: kubernetes
I0615 00:04:13.589971 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0615 00:04:14.063246 1 dns.go:171] Initialized services and endpoints from apiserver
I0615 00:04:14.063267 1 server.go:129] Setting up Healthz Handler (/readiness)
I0615 00:04:14.063274 1 server.go:134] Setting up cache handler (/cache)
I0615 00:04:14.063288 1 server.go:120] Status HTTP port 8081
kube-proxy seems happy:
I0615 00:03:53.448369 5706 proxier.go:864] Setting endpoints for "default/kubernetes:https" to [172.31.44.133:6443]
I0615 00:03:53.545124 5706 controller_utils.go:1001] Caches are synced for service config controller
I0615 00:03:53.545146 5706 config.go:210] Calling handler.OnServiceSynced()
I0615 00:03:53.545208 5706 proxier.go:979] Not syncing iptables until Services and Endpoints have been received from master
I0615 00:03:53.545125 5706 controller_utils.go:1001] Caches are synced for endpoints config controller
I0615 00:03:53.545224 5706 config.go:110] Calling handler.OnEndpointsSynced()
I0615 00:03:53.545274 5706 proxier.go:309] Adding new service port "default/kubernetes:https" at 10.0.0.1:443/TCP
I0615 00:03:53.545329 5706 proxier.go:991] Syncing iptables rules
I0615 00:03:53.993514 5706 proxier.go:991] Syncing iptables rules
I0615 00:03:54.008738 5706 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
I0615 00:04:24.008904 5706 proxier.go:991] Syncing iptables rules
I0615 00:04:24.023057 5706 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s
result
However, I don't seem to be able to resolve anything inside the cluster, same result with docker exec or kube exec:
➜ kubernetes git:(master) kc exec --namespace=kube-system kube-dns-2673147055-4j6wm -- nslookup kubernetes.default.svc.cluster.local
Defaulting container name to kubedns.
Use 'kubectl describe pod/kube-dns-2673147055-4j6wm' to see all of the
containers in this pod.
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'kubernetes.default.svc.cluster.local': Name does not resolve
question
Whats the simplest way to further debug a system created using local-up-cluster where the DNS pods are running, but kubernetes.default.svc.cluster.local is not resolved ? Note that all other aspects of this cluster appear to be working perfectly.
System info : Linux ip-172-31-44-133 4.4.0-1018-aws #27-Ubuntu SMP Fri May 19 17:20:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.
Example of resolv.conf that is being placed in my containers...
/etc/cfssl # cat /etc/resolv.conf
nameserver 172.17.0.1
search default.svc.cluster.local svc.cluster.local cluster.local dc1.lan
options ndots:5

I can't comment on your post so I'll attempt to answer this.
First of all, certain Alpine images have trouble resolving using nslookup. DNS might in fact be working normally in your cluster.
To validate this, read the logs of the pods (eg. traefik, heapster, calico) that communicates with kube-apiserver. If no errors are observed, what you have is probably a non-problem.
If you want to be doubly-sure, deploy a non-Alpine pod and try nslookup.
If it really is a DNS issue, I will debug in this sequence.
kubectl exec into the kube-dns pod. Run nslookup kubernetes.default.svc.cluster.local localhost. If this works, DNS is in fact running. If it doesn't, kube-dns should have entered a CrashLoopbackOff state by now.
kubectl exec into a deployed pod. Run nslookup kubernetes.default.svc.cluster.local <cluster-ip>. If this works, you're good to go. If it doesn't, something is up with the pod network. Without details, I can't recommend further steps.
Bonne chance!

I figured I'd post a systematic answer that usually works for me. I was hoping for something more elegant, and this isn't ideal, but I think its the best place to start.
1) Make sure your DNS nanny and your SkyDNS are running. The nanny and Sky DNS should both show in their docker logs that they've bound to a port.
2) When you create new services, make sure that SkyDNS is writing them to the logs and showing the creation of SRV and so on.
3) Look in /etc/resolv.conf in your docker containers. Make sure the nameserver looks like something on your internal docker IP addresses (i.e. 10.... in a regular docker0 config on fedora)
There are specific env variables you need to export correctly: API_HOST=true and KUBE_ENABLE_CLUSTER_DNS=true.
Theres alot of deeper tools you can use, like route -n and so on to debug container networking even more, but local up cluster should generally 'just work' and if the above steps surface something supsicious, its worth mentioning in the kubernetes community as a possible bug.

Related

How to access hosts in my network from microk8s deployment pods

I am trying to access a host that sits in another server (but on my network) from inside the pod of deployment and I am using microk8s.
The thing is that on the server where I have microk8s installed I can easily ping it by ping my-network-host.qa.local. But when I go inside the pod with microk8s kubectl exec -it pod_name -- /bin/bash and I do ping my-network-host.qa.local it says: Name or service not known.
And when I connect to a VPN on my computer to be on that network and I deploy it locally using docker-desktop kubernetes I can ping that host from within the pod. So I think the problem sits in microk8s which is not letting my pod use my network.
Is there any way to tell microk8s to use my hosts from my network?
p.s. I can ping the ip of that server from the pod, but I am not being able to ping the host from the pod
Based on another answer that I found on StackOverflow, i managed to fixed it.
There were 2 changes needed to make it work:
Update kubelet configuration to use resolv-conf:
sudo echo "--resolv-conf=/run/systemd/resolve/resolv.conf" >> /var/snap/microk8s/current/args/kubelet
Restart kubelet service:
sudo service snap.microk8s.daemon-kubelet restart
Then change the CoreDNS forward to point to your nameserver:
First open coredns config map so you can edit it
sudo microk8s.kubectl edit configmap coredns -n kube-system
and update the file at
forward . 8.8.8.8 8.8.4.4 #REMOVE THIS LINE
forward . xxx.xxx.xxx.xxx #ADD THIS WITH YOUR IP
You can get eth0 DNS address:
nmcli dev show 2>/dev/null | grep DNS | sed 's/^.*:\s*//'
On my case I already had the ip as nameserver on /run/systemd/resolve/resolv.conf.
Now just save the changes, and go inside your pods so you will be able to access them.
There is a comment in another post that suggest adding
forward . /etc/resolv.conf
But that didnt work on my case.

One openshift-origin worker node won't resolv cluster.local records, causing Imagepullbackoff

We have setup an okd 3.11 cluster with 100+ nodes. Everything was working fine but then a worker node stopped resolving the registry service internal url. This causes new pods to be scheduled to that node fail with ImagePullBackoff error.
Failed to pull image "docker-registry.default.svc:5000/app-name/app-name:latest": rpc error: code = Unknown desc = Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.*.*.71:53: server misbehaving
We tried running nslookup on the worker node and following were the results
While this doesn't work (while it works on other nodes)
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local
Server: 10.*.*.71
Address: 10.*.*.71#53
** server can't find docker-registry.default.svc.cluster.local: SERVFAIL
This works just fine.
[root#worker22 ~]# nslookup docker-registry.default.svc.cluster.local 127.0.0.1
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: docker-registry.default.svc.cluster.local
Address: 172.*.*.212
Adding server=/cluster.local/172.30.0.1 to dnsmasq conf file /etc/dnsmasq.d/origin-upstream-dns.conf works as a work around but can't find what is causing this.
I have tried adding -q to dnsmasq service's ExecStart and it shows that the dnsmasq won't query the openshift dns running locally at 127.0.0.1:53.
Dnsmasq config/resolv.conf is in order on the node.
I have tried restarting dnsmasq/NetworkManager/Docker, I have tried respawning ovs/sdn pods but still no help.
Found some documented evidence that dnsmasq can behave like that.
It has been suggested by some RedHat articles that a long running dnsmasq service may misbehave and stop resolving names. Similar cases have been reported for openshift environment as well.
The links below suggest that restarting the service would solve the problem for some time and then the issue may resurface. As stated earlier, in my case service restart didn't help but oldest remedy in IT worked (rebooting the node solved the problem).
Reference:
https://access.redhat.com/solutions/3393141
https://bugzilla.redhat.com/show_bug.cgi?id=1560489

Do OpenShift nodes provide a dns service by default?

I have an OpenShift, cluster, and periodically when accessing logs, I get:
worker1-sass-on-prem-origin-3-10 on 10.1.176.130:53: no such host" kube doing a connection to 53 on a node.
I also tend to see tcp: lookup postgres.myapp.svc.cluster.local on 10.1.176.136:53: no such host errors from time to time in pods, again, this makes me think that, when accessing internal service endpoints, pods, clients, and other Kubernetes related services actually talk to a DNS server that is assumed to be running on the given node that said pods are running on.
Update
Looking into one of my pods on a given node, I found the following in resolv.conf (I had to ssh and run docker exec to get this output - since oc exec isn't working due to this issue).
/etc/cfssl $ cat /etc/resolv.conf
nameserver 10.1.176.129
search jim-emea-test.svc.cluster.local svc.cluster.local cluster.local bds-ad.lc opssight.internal
options ndots:5
Thus, it appears that in my cluster, containers have a self-referential resolv.conf entry. This cluster is created with openshift-ansible. I'm not sure if this is infra-specific, or if its actually a fundamental aspect of how openshift nodes work, but i suspect the latter, as I haven't done any major customizations to my ansible workflow from the upstream openshift-ansible recipes.
Yes, DNS on every node is normal in openshift.
It does appear that its normal for an openshift ansible deployment to deploy dnsmasq services on every node.
Details.
As an example of how this can effect things, the following https://github.com/openshift/openshift-ansible/pull/8187 is instructive. In any case, if a local node's dnsmasq is acting flakey for any reason, it will prevent containers running on that node from properly resolving addresses of other containers in a cluster.
Looking deeper at the dnsmasq 'smoking gun'
After checking on an individual node, I found that in fact, there was a process indeed bounded to port 53, and it is dnsmasq. Hence,
[enguser#worker0-sass-on-prem-origin-3-10 ~]$ sudo netstat -tupln | grep 53
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 675/openshift
And, dnsmasq is running locally:
[enguser#worker0-sass-on-prem-origin-3-10 ~]$ ps -ax | grep dnsmasq
4968 pts/0 S+ 0:00 grep --color=auto dnsmasq
6994 ? Ss 0:22 /usr/sbin/dnsmasq -k
[enguser#worker0-sass-on-prem-origin-3-10 ~]$ sudo ps -ax | grep dnsmasq
4976 pts/0 S+ 0:00 grep --color=auto dnsmasq
6994 ? Ss 0:22 /usr/sbin/dnsmasq -k
The final clue, resolv.conf itself is even adding the local IP address as a nameserver... And this is obviously borrowed into containers that start.
nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
Generated by NetworkManager
search cluster.local bds-ad.lc opssight.internal
NOTE: the libc resolver may not support more than 3 nameservers.
The nameservers listed below may not be recognized.
nameserver 10.1.176.129
The solution (in my specific case)
In my case , this was happening because the local nameserver was using an ifcfg (you can see these files in /etc/sysconfig/network-scripts/) with
[enguser#worker0-sass-on-prem-origin-3-10 network-scripts]$ cat ifcfg-ens192
TYPE=Ethernet
BOOTPROTO=dhcp
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens192
UUID=50936212-cb5e-41ff-bec8-45b72b014c8c
DEVICE=ens192
ONBOOT=yes
However, my internally configured Virtual Machines could not resolve IPs provided to them by the PEERDNS records.
Ultimately the fix was to work with our IT department to make sure our authoritative domain for our kube clusters had access to all IP addresses in our data center.
The Generic Fix to :53 lookup errors...
If youre seeing the :53 record errors are coming up when you try to kubectl or oc logs / exec, then there is likely that your apiserver is not able to connect with kubelets via their IP address.
If youre seeing :53 record errors in other places, for example, inside of pods, then this is because your pod, using its own local DNS, isnt able to resolve internal cluster IP addresses. This might simply be because you have an outdated controller that is looking for services that don't exist anymore, or else, you have flakiness at your kubernetes dns implementation level.

Service IP is not accessible across nodes in kubernetes

I have created a kubernetes v1.2 running in Azure cloud with one master(Master) and two nodes(Node1 and Node2). I have deployed an Nginx and Tomcat application. Both the containers are deployed in individual pods with RC and they have a SERVICE for each.
Nginx pod is deployed in the Node1 and Tomcat pod is deployed in Node2. Now Nginx from Node1 is trying to access Tomcat via tomcat's ServiceIP(clusterIP) which is in Node2. But its unreachable.
Nginx serviceIP: 10.16.0.2 Node1
Tomcat serviceIP: 10.16.0.4 Node2
I tried curl 10.16.0.4:8080 from Node2, it works. But same from Node1 fails with curl: (52) Empty reply from server
So communication to serviceIP across nodes fails. Is this the problem with kube v1.2?
Note: ClusterIP for the Service will be specified at the time of creating the service.
Since you are able to reach the cluster ip from the Node2, it looks like the service selector is properly defined.
Kube-proxy is the component that watches the services and creates iptable rules for end points. I would check if kube-proxy is running properly on Node1. Then check if iptable rules are set properly for the cluster ip you are trying to reach.
You can see these with iptables -L -t nat | grep namespace/servicename
Here is an example:
bash-4.3# iptables -L -t nat | grep kube-system/heapster
KUBE-MARK-MASQ all -- 172.168.16.182 anywhere /* kube-system/heapster: */
DNAT tcp -- anywhere anywhere /* kube-system/heapster: */ tcp to:172.168.16.182:8082
KUBE-SVC-BJM46V3U5RZHCFRZ tcp -- anywhere 192.168.172.66 /* kube-system/heapster: cluster IP */ tcp dpt:http
KUBE-SEP-KNJP5BBKUOCH7NDB all -- anywhere anywhere /* kube-system/heapster: */
In this example I looked up heapster running in kube-system namespace. It is showing that the cluster ip is 192.168.172.66 DNATs to the endpoint 172.168.16.182, which is pods ip (You should cross check this with the endpoints listed in kubectl describe service.
If is it not there, restarting kube-proxy might help.

Adding nameservers to kubernetes

I'm using Kubernetes v1.0.6 on AWS that has been deployed using kube-up.sh.
Cluster is using kube-dns.
$ kubectl get svc kube-dns --namespace=kube-system
NAME LABELS SELECTOR IP(S) PORT(S)
kube-dns k8s-app=kube-dns,kubernetes.io/cluster-service=true,kubernetes.io/name=KubeDNS k8s-app=kube-dns 10.0.0.10 53/UDP
Which works fine.
$ kubectl exec busybox -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10 ip-10-0-0-10.eu-west-1.compute.internal
Name: kubernetes.default
Address 1: 10.0.0.1 ip-10-0-0-1.eu-west-1.compute.internal
This is the resolv.conf of a pod.
$ kubectl exec busybox -- cat /etc/resolv.conf
nameserver 10.0.0.10
nameserver 172.20.0.2
search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
Is it possible to have the containers use an additional nameserver?
I have a secondary DNS based service discovery Oon let's say 192.168.0.1) that I would like my kubernetes containers to be able to use for dns resolution.
ps. A kubernetes 1.1 solution would also be acceptable :)
Thank you very much in advance,
George
The DNS addon README has some details on this. Basically, the pod will inherit the resolv.conf setting of the node it is running on, so you could add your extra DNS server to the nodes' /etc/resolv.conf. The kubelet also takes a --resolv-conf argument that may provide a more explicit way for you to inject the extra DNS server. I don't see that flag documented anywhere yet, however.
In Kuberenetes (probably) 1.2 we'll be moving to a model where nameservers are assumed to be fungible. There are too many resolvers that break when different nameservers serve different subsets of DNS, and there is no real specification here that we can point to.
In other words, we'll start dropping the host's nameserver records from the container's merged resolv.conf and making our own DNS server the only nameserver line. Our DNS will be able to forward requests to upstream nameservers.
I eventually managed to solve this pretty easily by configuring SkyDNS to add an additional nameserver, you can just add the environmental variable SKYDNS_NAMESERVERS as defined in the SkyDNS docs in your SkyDNS replication controller. It has minimal impact and does not depend on node changes etc.
env:
- name: SKYDNS_NAMESERVERS
value: 10.0.0.254:53,10.0.64.254:53
For those usign Kubernetes kube-dns, flag -nameservers nor environment variable SKYDNS_NAMESERVERS are no longer avaiable.
Usage of /kube-dns:
--alsologtostderr log to standard error as well as files
--config-map string config-map name. If empty, then the config-map will not used. Cannot be used in conjunction with federations flag. config-map contains dynamically adjustable configuration.
--config-map-namespace string namespace for the config-map (default "kube-system")
--dns-bind-address string address on which to serve DNS requests. (default "0.0.0.0")
--dns-port int port on which to serve DNS requests. (default 53)
--domain string domain under which to create names (default "cluster.local.")
--healthz-port int port on which to serve a kube-dns HTTP readiness probe. (default 8081)
--kube-master-url string URL to reach kubernetes master. Env variables in this flag will be expanded.
--kubecfg-file string Location of kubecfg file for access to kubernetes master service; --kube-master-url overrides the URL part of this; if neither this nor --kube-master-url are provided, defaults to service account tokens
--log-backtrace-at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log-dir string If non-empty, write log files in this directory
--log-flush-frequency duration Maximum number of seconds between log flushes (default 5s)
--logtostderr log to standard error instead of files (default true)
--stderrthreshold severity logs at or above this threshold go to stderr (default 2)
-v, --v Level log level for V logs
--version version[=true] Print version information and quit
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
Now, either you put your name servers on the hosts resolv.conf, so DNS is inherited from the node, or you use custom resolv.conf and add it to Kubelet with the flag --resolv-conf as explained here
You need to know the IP of your Core DNS to set it as a secondary DNS
Run this command to get the CoreDNS IP:
kubectl -n kube-system get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 172.20.0.10 <none> 53/UDP,53/TCP 43d
metrics-server ClusterIP 172.20.232.147 <none> 443/TCP 43d
This is how I setup DNS in my deployment yaml.
I posted the Google DNS IP (for clarity) and my CoreDNS ip, but you should use your VPC DNS and your CoreDNS server.
containers:
- name: nginx
image: nginx
ports:
- containerPort: 8080
dnsPolicy: None
dnsConfig:
nameservers:
- 8.8.8.8
- 172.20.0.10
searches:
- 1b.svc.cluster.local
- svc.cluster.local
- cluster.local
- ec2.internal
options:
- name: ndots
value: "5"

Resources