Cilium clustermesh with azure - azure

I'm deploying a clustermesh using the Aks-engine. I have installed cilium on two different clusters. Following the clustermesh installation guide everything looks correct. Nodes are listed, the status is correct and no errors appear in the etcd-operator log. However, I cannot access external endpoints. The example app is always answering from the current cluster.
Following the troubleshooting guide I have found in the debuginfo from the agents that no external endpoints are declared. Clusters have a master and two slave nodes. I attach the node list and status from both clusters. I can provide additional logs if required.
Any help would be appreciated.
Cluster1
kubectl -nkube-system exec -it cilium-vg8sm cilium node list
Name IPv4 Address Endpoint CIDR IPv6 Address Endpoint CIDR
cluster1/k8s-cilium2-29734124-0 172.18.2.5 192.168.1.0/24
cluster1/k8s-cilium2-29734124-1 172.18.2.4 10.4.0.0/16
cluster1/k8s-master-29734124-0 172.18.1.239 10.239.0.0/16
cluster2/k8s-cilium2-14610979-0 172.18.2.6 192.168.2.0/24
cluster2/k8s-cilium2-14610979-1 172.18.2.7 10.7.0.0/16
cluster2/k8s-master-14610979-0 172.18.2.239 10.239.0.0/16
kubectl -nkube-system exec -it cilium-vg8sm cilium status
KVStore: Ok etcd: 1/1 connected: https://cilium-etcd-client.kube-system.svc:2379 - 3.3.11
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Ok 1.15 (v1.15.1) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPv4 address pool: 10/65535 allocated from 10.4.0.0/16
Controller Status: 48/48 healthy
Proxy Status: OK, ip 10.4.0.1, port-range 10000-20000
Cluster health: 6/6 reachable (2019-08-09T10:11:22Z)
Cluster2
kubectl -nkube-system exec -it cilium-rl8gt cilium node list
Name IPv4 Address Endpoint CIDR IPv6 Address Endpoint CIDR
cluster1/k8s-cilium2-29734124-0 172.18.2.5 192.168.1.0/24
cluster1/k8s-cilium2-29734124-1 172.18.2.4 10.4.0.0/16
cluster1/k8s-master-29734124-0 172.18.1.239 10.239.0.0/16
cluster2/k8s-cilium2-14610979-0 172.18.2.6 192.168.2.0/24
cluster2/k8s-cilium2-14610979-1 172.18.2.7 10.7.0.0/16
cluster2/k8s-master-14610979-0 172.18.2.239 10.239.0.0/16
kubectl -nkube-system exec -it cilium-rl8gt cilium status
KVStore: Ok etcd: 1/1 connected: https://cilium-etcd-client.kube-system.svc:2379 - 3.3.11
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Ok 1.15 (v1.15.1) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPv4 address pool: 10/65535 allocated from 10.7.0.0/16
Controller Status: 48/48 healthy
Proxy Status: OK, ip 10.7.0.1, port-range 10000-20000
Cluster health: 6/6 reachable (2019-08-09T10:40:39Z)

This problem is fixed with https://github.com/cilium/cilium/issues/8849 and will be available in version 1.6.

Related

Python/Pod cannot reach the internet

I'm using python3 with microk8s to develop a simple web service.
The service is working properly (with docker in my local development machine), but the production machine (Ubuntu18.04 LTS with microk8s in Azure) cannot reach the internet (SMTP/Web REST API) once the pod was started (all internal service is working).
Problem
The pod cannot ping the hostname but the IP address. After investigation, the pod is working as expected except for the external resource. When executing the nslookup, it seems to be ok. But the ping is not working.
bash-5.1# ping www.google.com
ping: bad address 'www.google.com'
bash-5.1# nslookup www.google.com
Server: 10.152.183.10
Address: 10.152.183.10:53
Non-authoritative answer:
Name: www.google.com
Address: 74.125.68.103
Name: www.google.com
Address: 74.125.68.106
Name: www.google.com
Address: 74.125.68.99
Name: www.google.com
Address: 74.125.68.104
Name: www.google.com
Address: 74.125.68.105
Name: www.google.com
Address: 74.125.68.147
Non-authoritative answer:
Name: www.google.com
Address: 2404:6800:4003:c02::93
Name: www.google.com
Address: 2404:6800:4003:c02::63
Name: www.google.com
Address: 2404:6800:4003:c02::67
Name: www.google.com
Address: 2404:6800:4003:c02::69
bash-5.1# ping 74.125.68.103
PING 74.125.68.103 (74.125.68.103): 56 data bytes
64 bytes from 74.125.68.103: seq=0 ttl=55 time=1.448 ms
64 bytes from 74.125.68.103: seq=1 ttl=55 time=1.482 ms
^C
--- 74.125.68.103 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.448/1.465/1.482 ms
bash-5.1# python3
>>> import socket
>>> socket.gethostname()
'projects-dep-65d7b8685f-jzmxx'
>>> socket.gethostbyname('www.google.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -3] Try again
Environments/Settings
host $ #In Host
host $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
host $ microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
addons:
enabled:
dashboard
dns
ha-cluster
ingress
metrics-server
registry
storage
disabled:
ambassador
cilium
fluentd
gpu
helm
helm3
host-access
istio
jaeger
keda
knative
kubeflow
linkerd
metallb
multus
portainer
prometheus
rbac
traefik
# In Pod
bash-5.1 # python3
>>> import sys
>>> print({'version':sys.version, 'version-info': sys.version_info})
{'version': '3.9.3 (default, Apr 2 2021, 21:20:32) \n[GCC 10.2.1 20201203]', 'version-info': sys.version_info(major=3, minor=9, micro=3, releaselevel='final', serial=0)}
bash-5.1 #
bash-5.1 # cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local ngqy0alqbw2elndk2awonodqmd.ix.internal.cloudapp.net
nameserver 10.152.183.10
options ndots:5
You can confirm your pod network namespace can connect to external and internal vnet ips or not through the following commands: -
kubectl --namespace=kube-system exec -it ${KUBE-DNS-POD-NAME} -c kubedns -- sh
#run ping/or nslookup using metadata endpoint
If you restart the pod or container, it can fix the issue of hostname not resolving for external IP addresses or else, you can move the pod to a different node. Also, edit the Kubernetes dns add on master (repeat for every master) as below : -
vi /etc/kubernetes/addons/kube-dns-deployment.yaml
And change the arguments for the health container as below: -
"--cmd=nslookup bing.com 127.0.0.1 >/dev/null"
"--url=/healthz-dnsmasq"
"--cmd=nslookup bing.com 127.0.0.1:10053 >/dev/null"
"--url=/healthz-kubedns"
"--port=8080"
"--quiet"
You can also try restarting the kube coredns through the following command: -
kubectl -n kube-system rollout
This will force the kubedns container to restart if the above condition occurs.
Thanking you,

Kubernetes ingress "an error on the server ("") has prevented the request from succeeding"

I have a managed azure cluster (AKS) with nginx ingress in it.
It was working fine but now nginx ingress stopped:
# kubectl -v=7 logs nginx-ingress-<pod-hash> -n nginx-ingress
GET https://<PRIVATE-IP-SVC-Kubernetes>:443/version?timeout=32s
I1205 16:59:31.791773 9 round_trippers.go:423] Request Headers:
I1205 16:59:31.791779 9 round_trippers.go:426] Accept: application/json, */*
Unexpected error discovering Kubernetes version (attempt 2): an error on the server ("") has prevented the request from succeeding
# kubectl describe svc kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: <PRIVATE-IP-SVC-Kubernetes>
Port: https 443/TCP
TargetPort: 443/TCP
Endpoints: <PUBLIC-IP-SVC-Kubernetes>:443
Session Affinity: None
Events: <none>
When I tried to curl https://PRIVATE-IP-SVC-Kubernetes:443/version?timeout=32s, I've always seen the same output:
curl: (35) SSL connect error
On my OCP 4.7 (OpenShift Container Registry) instances with 3 of master and 2 of worker nodes, the following log appears after kubelet and oc commands.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1-5-g76a04fc", GitCommit:"e29b355", GitTreeState:"clean", BuildDate:"2021-06-03T21:19:58Z", GoVersion:"go1.15.7", Compiler:"gc", Platform:"linux/amd64"}
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding
$ oc get nodes
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding
Also, when I wanted to login to the OCP dashboard, the following error occurred:
error_description": "The authorization server encountered an unexpected condition that prevented it from fulfilling the request
I restarted the whole master node machines then the problem solved.
I faced the same issue with three manager cluster and I was accessing it through ucp client bundle. I figured out 2 out of 3 manager nodes were in not ready state. On debugging further I found space issue on those not ready boxes. After cleaning a little (mainly /var folder) and restart docker, those nodes came to ready state and I'm not getting this error.
On Windows: edit host file (vi /etc/hosts) and replace the line with:
127.0.0.1 ~/.kube/config
Worked for me !!!

Error: forwarding ports: error upgrading connection: error dialing backend: - Azure Kubernetes Service

We have upgraded our Kubernates Service cluster on Azure to latest version 1.12.4. After that we suddenly recognize that pods and nodes cannot communicate between anymore by private ip :
kubectl get pods -o wide -n kube-system -l component=kube-proxy
NAME READY STATUS RESTARTS AGE IP NODE
kube-proxy-bfhbw 1/1 Running 2 16h 10.0.4.4 aks-agentpool-16086733-1
kube-proxy-d7fj9 1/1 Running 2 16h 10.0.4.35 aks-agentpool-16086733-0
kube-proxy-j24th 1/1 Running 2 16h 10.0.4.97 aks-agentpool-16086733-3
kube-proxy-x7ffx 1/1 Running 2 16h 10.0.4.128 aks-agentpool-16086733-4
As you see the node aks-agentpool-16086733-0 has private IP 10.0.4.35 . When we try to check logs on pods which are on this node we got such error:
Get
https://aks-agentpool-16086733-0:10250/containerLogs/emw-sit/nginx-sit-deploy-864b7d7588-bw966/nginx-sit?tailLines=5000&timestamps=true: dial tcp 10.0.4.35:10250: i/o timeout
We got the Tiller ( Helm) on this node as well, and if try to connect to tiller we got such error from Client PC:
shmits-imac:~ andris.shmits01$ helm version Client:
&version.Version{SemVer:"v2.12.3",
GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e",
GitTreeState:"clean"} Error: forwarding ports: error upgrading
connection: error dialing backend: dial tcp 10.0.4.35:10250: i/o
timeout
Does anybody have any idea why the pods and nodes lost connectivity by private IP ?
So , after we scaled down the cluster from 4 nodes to 2 nodes problem disappeared. And after we again scaled up from 2 nodes to 4 everything started working fine
issue could be with apiserver. did you check logs from apiserver pod?
can you run the below command inside cluster. do you 200 OK response?
curl -k -v https://10.96.0.1/version
These issues come when nodes in the Kubernetes cluster created using kubeadm do not get proper Internal IP addresses matching with Nodes/Machines IP.
Issue: If I run helm list command from my cluster then I get below error
helm list
Error: forwarding ports: error upgrading connection: unable to upgrade connection: pod does not exist
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k-master Ready master 3h10m v1.18.5 10.0.0.5 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker01 Ready <none> 179m v1.18.5 10.0.0.6 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker02 Ready <none> 167m v1.18.5 10.0.2.15 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
Please note: k-worker02 has internal IP as 10.0.2.15 but I was expecting 10.0.0.7 which is my node/machine IP.
Solution:
Step 1: Connect to Host ( here k-worker02) which does have expected IP
Step 2: open below file
sudo vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Step 3: Edit and append with --node-ip 10.0.0.7
code snippet
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --node-ip 10.0.0.7
Step 4: Reload the daemon and restart the kubelet service
sudo systemctl daemon-reload && sudo systemctl restart kubelet
Result:
kubectl get nodes -o wide
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k-master Ready master 3h36m v1.18.5 10.0.0.5 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker01 Ready <none> 3h25m v1.18.5 10.0.0.6 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker02 Ready <none> 3h13m v1.18.5 10.0.0.7 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
With the above solution, the k-worker02 node has got expected IP (10.0.07) and "forwarding ports:" error stops coming from "helm list or helm install commnad".
Reference: https://networkinferno.net/trouble-with-the-kubernetes-node-ip

New AKS cluster unreachable via network (including dashboard)

Yesterday I spun up an Azure Kubernetes Service cluster running a few simple apps. Three of them have exposed public IPs that were reachable yesterday.
As of this morning I can't get the dashboard tunnel to work or the LoadBalancer IPs themselves.
I was asked by the Azure twitter account to solicit help here.
I don't know how to troubleshoot this apparent network issue - only az seems to be able to touch my cluster.
dashboard error log
❯❯❯ make dashboard ~/c/azure-k8s (master)
az aks browse --resource-group=akc-rg-cf --name=akc-237
Merged "akc-237" as current context in /var/folders/9r/wx8xx8ls43l8w8b14f6fns8w0000gn/T/tmppst_atlw
Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out
service+pod listing
❯❯❯ kubectl get services,pods ~/c/azure-k8s (master)
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
azure-vote-back ClusterIP 10.0.125.49 <none> 6379/TCP 16h
azure-vote-front LoadBalancer 10.0.185.4 40.71.248.106 80:31211/TCP 16h
hubot LoadBalancer 10.0.20.218 40.121.215.233 80:31445/TCP 26m
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 19h
mti411-web LoadBalancer 10.0.162.209 52.168.123.30 80:30874/TCP 26m
NAME READY STATUS RESTARTS AGE
azure-vote-back-7556ff9578-sjjn5 1/1 Running 0 2h
azure-vote-front-5b8878fdcd-9lpzx 1/1 Running 0 16h
hubot-74f659b6b8-wctdz 1/1 Running 0 9s
mti411-web-6cc87d46c-g255d 1/1 Running 0 26m
mti411-web-6cc87d46c-lhjzp 1/1 Running 0 26m
http failures
❯❯❯ curl --connect-timeout 2 -I http://40.121.215.233 ~/c/azure-k8s (master)
curl: (28) Connection timed out after 2005 milliseconds
❯❯❯ curl --connect-timeout 2 -I http://52.168.123.30 ~/c/azure-k8s (master)
curl: (28) Connection timed out after 2001 milliseconds
If you are getting getsockopt: connection timed out while trying to access to your AKS Dashboard, I think deleting tunnelfront pod will help as once you delete the tunnelfront pod, this will trigger creation of new tunnelfront by Master. Its something I have tried and worked for me.
#daniel Did rebooting the agent VM's solve your issue or are you still seeing issues?

How to fix etcd cluster misconfigured error

Have two servers : pg1: 10.80.80.195 and pg2: 10.80.80.196
Version of etcd :
etcd Version: 3.2.0
Git SHA: 66722b1
Go Version: go1.8.3
Go OS/Arch: linux/amd64
I'm trying to run like this :
pg1 server :
etcd --name infra0 --initial-advertise-peer-urls http://10.80.80.195:2380 --listen-peer-urls http://10.80.80.195:2380 --listen-client-urls http://10.80.80.195:2379,http://127.0.0.1:2379 --advertise-client-urls http://10.80.80.195:2379 --initial-cluster-token etcd-cluster-1 --initial-cluster infra0=http://10.80.80.195:2380,infra1=http://10.80.80.196:2380 --initial-cluster-state new
pg2 server :
etcd --name infra1 --initial-advertise-peer-urls http://10.80.80.196:2380 --listen-peer-urls http://10.80.80.196:2380 --listen-client-urls http://10.80.80.196:2379,http://127.0.0.1:2379 --advertise-client-urls http://10.80.80.196:2379 --initial-cluster-token etcd-cluster-1 --initial-cluster infra0=http://10.80.80.195:2380,infra1=http://10.80.80.196:2380 --initial-cluster-state new
When trying to cherck health state on pg1:
etcdctl cluster-health
have an error :
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://127.0.0.1:2379 exceeded header timeout
; error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #0: client: endpoint http://127.0.0.1:2379 exceeded header timeout
error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused
What I'm doing wrong and how to fix it ?
Both servers run on virtual machines with Bridged Adapter
I've got similar error when I set up etcd clusters using systemd according to the official tutorial from kubernetes.
It's three centos 7 of medium instances on AWS. I'm pretty sure the security groups are correct. And I've just:
$ systemctl restart network
and the
$ etcdctl cluster-health
just gives a healthy result.

Resources