Gitlab Autodevops: Resetting a kubernetes cluster - gitlab

I'm currently on a self-hosted Gitlab 11.9 instance. I have the ability to add a kube cluster to projects on an individual level, but not on a group level (that was introduced in 11.10).
I created a Kubernetes cluster on AWS EKS and successfully connected it to Gitlab's Autodevops for a specific project. I was able to successfully install Helm tiller, Prometheus, and Gitlab Runner. Autodevops was working fine for that project.
Before I discovered that having a cluster run at the group-level was introduced in Gitlab 11.10, I disconnected the kube cluster from the first project and connected it at the group-level. I successfully installed Helm Tiller but failed to install Ingres or Cert-Manager. After I discovered my version doesn't contain group-level autodevops functionality, I connected the cluster to another, different, application and attempted to install Prometheus and Gitlab Runner. However, the operation failed.
My pods are as follows:
% kubectl get pods --namespace=gitlab-managed-apps
NAME READY STATUS RESTARTS AGE
install-prometheus 0/1 Error 0 18h
install-runner 0/1 Error 0 18h
prometheus-kube-state-metrics-8668948654-8p4d5 1/1 Running 0 18h
prometheus-prometheus-server-746bb67956-789ln 2/2 Running 0 18h
runner-gitlab-runner-548ddfd4f4-k5r8s 1/1 Running 0 18h
tiller-deploy-6586b57bcb-p8kdm 1/1 Running 0 18h
Here's some output from my log file:
% kubectl logs install-prometheus --namespace=gitlab-managed-apps --container=helm
+ helm init --upgrade
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
+ seq 1 30
+ helm version
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Error: cannot connect to Tiller
+ sleep 1s
Retrying (1)...
+ echo 'Retrying (1)...'
+ helm version
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Error: cannot connect to Tiller
...
+ sleep 1s
+ echo 'Retrying (30)...'
+ helm upgrade prometheus stable/prometheus --install --reset-values --tls --tls-ca-cert /data/helm/prometheus/config/ca.pem --tls-cert /data/helm/prometheus/config/cert.pem --tls-key /data/helm/prometheus/config/key.pem --version 6.7.3 --set 'rbac.create=false,rbac.enabled=false' --namespace gitlab-managed-apps -f /data/helm/prometheus/config/values.yaml
Retrying (30)...
Error: UPGRADE FAILED: remote error: tls: bad certificate
This cluster doesn't contain anything else except for services, pods, deployments specifically for autodevops. How should I go about 'resetting' the cluster or uninstalling services?

Related

Error: forwarding ports: error upgrading connection: error dialing backend: - Azure Kubernetes Service

We have upgraded our Kubernates Service cluster on Azure to latest version 1.12.4. After that we suddenly recognize that pods and nodes cannot communicate between anymore by private ip :
kubectl get pods -o wide -n kube-system -l component=kube-proxy
NAME READY STATUS RESTARTS AGE IP NODE
kube-proxy-bfhbw 1/1 Running 2 16h 10.0.4.4 aks-agentpool-16086733-1
kube-proxy-d7fj9 1/1 Running 2 16h 10.0.4.35 aks-agentpool-16086733-0
kube-proxy-j24th 1/1 Running 2 16h 10.0.4.97 aks-agentpool-16086733-3
kube-proxy-x7ffx 1/1 Running 2 16h 10.0.4.128 aks-agentpool-16086733-4
As you see the node aks-agentpool-16086733-0 has private IP 10.0.4.35 . When we try to check logs on pods which are on this node we got such error:
Get
https://aks-agentpool-16086733-0:10250/containerLogs/emw-sit/nginx-sit-deploy-864b7d7588-bw966/nginx-sit?tailLines=5000&timestamps=true: dial tcp 10.0.4.35:10250: i/o timeout
We got the Tiller ( Helm) on this node as well, and if try to connect to tiller we got such error from Client PC:
shmits-imac:~ andris.shmits01$ helm version Client:
&version.Version{SemVer:"v2.12.3",
GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e",
GitTreeState:"clean"} Error: forwarding ports: error upgrading
connection: error dialing backend: dial tcp 10.0.4.35:10250: i/o
timeout
Does anybody have any idea why the pods and nodes lost connectivity by private IP ?
So , after we scaled down the cluster from 4 nodes to 2 nodes problem disappeared. And after we again scaled up from 2 nodes to 4 everything started working fine
issue could be with apiserver. did you check logs from apiserver pod?
can you run the below command inside cluster. do you 200 OK response?
curl -k -v https://10.96.0.1/version
These issues come when nodes in the Kubernetes cluster created using kubeadm do not get proper Internal IP addresses matching with Nodes/Machines IP.
Issue: If I run helm list command from my cluster then I get below error
helm list
Error: forwarding ports: error upgrading connection: unable to upgrade connection: pod does not exist
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k-master Ready master 3h10m v1.18.5 10.0.0.5 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker01 Ready <none> 179m v1.18.5 10.0.0.6 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker02 Ready <none> 167m v1.18.5 10.0.2.15 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
Please note: k-worker02 has internal IP as 10.0.2.15 but I was expecting 10.0.0.7 which is my node/machine IP.
Solution:
Step 1: Connect to Host ( here k-worker02) which does have expected IP
Step 2: open below file
sudo vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Step 3: Edit and append with --node-ip 10.0.0.7
code snippet
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --node-ip 10.0.0.7
Step 4: Reload the daemon and restart the kubelet service
sudo systemctl daemon-reload && sudo systemctl restart kubelet
Result:
kubectl get nodes -o wide
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k-master Ready master 3h36m v1.18.5 10.0.0.5 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker01 Ready <none> 3h25m v1.18.5 10.0.0.6 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
k-worker02 Ready <none> 3h13m v1.18.5 10.0.0.7 <none> Ubuntu 18.04.3 LTS 4.15.0-58-generic docker://19.3.12
With the above solution, the k-worker02 node has got expected IP (10.0.07) and "forwarding ports:" error stops coming from "helm list or helm install commnad".
Reference: https://networkinferno.net/trouble-with-the-kubernetes-node-ip

aks reporting "Insufficient pods"

I've gone through the Azure Cats&Dogs tutorial described here and I am getting an error in the final step where the apps are launched in AKS. Kubernetes is reporting that I have insufficent pods but I'm not sure why this would be. I've run through this same tutorial a few weeks ago without problems.
$ kubectl apply -f azure-vote-all-in-one-redis.yaml
deployment.apps/azure-vote-back created
service/azure-vote-back created
deployment.apps/azure-vote-front created
service/azure-vote-front created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
azure-vote-back-655476c7f7-mntrt 0/1 Pending 0 6s
azure-vote-front-7c7d7f6778-mvflj 0/1 Pending 0 6s
$ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
84s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
70s Warning FailedScheduling Pod skip schedule deleting pod: default/azure-vote-back-655476c7f7-l5j28
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-kjld6
99s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-l5j28
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-back-655476c7f7-mntrt
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
99s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-back-655476c7f7 to 1
9s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
3m36s Warning FailedScheduling Pod 0/1 nodes are available: 1 Insufficient pods.
53m Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-rmbqb
24s Normal SuccessfulCreate ReplicaSet Created pod: azure-vote-front-7c7d7f6778-mvflj
53m Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
53m Normal EnsuringLoadBalancer Service Ensuring load balancer
52m Normal EnsuredLoadBalancer Service Ensured load balancer
46s Normal DeletingLoadBalancer Service Deleting load balancer
24s Normal ScalingReplicaSet Deployment Scaled up replica set azure-vote-front-7c7d7f6778 to 1
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 7d4h v1.9.9
The only thing I can think of that has changed is that I have other (larger) clusters running now as well, and the main reason I went through this Cats&Dogs tutorial again was because I hit this same problem today with my other clusters. Is this a resources limit issue with my Azure account?
Update 10-20/3:15 PST: Notice how these three clusters all show that they use the same nodepool, even though they were created in different resource groups. Also note how the "get-credentials" call for gem2-cluster reports an error. I did have a cluster earlier called gem2-cluster which I deleted and recreated using the same name (in fact I deleted the wole resource group). What's the correct process for doing this?
$ az aks get-credentials --name gem1-cluster --resource-group gem1-rg
Merged "gem1-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-27217108-0 Ready agent 3h26m v1.9.11
$ az aks get-credentials --name gem2-cluster --resource-group gem2-rg
A different object named gem2-cluster already exists in clusters
$ az aks get-credentials --name gem3-cluster --resource-group gem3-rg
Merged "gem3-cluster" as current context in /home/psteele/.kube/config
$ kubectl get nodes -n gem1
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem2
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
$ kubectl get nodes -n gem3
NAME STATUS ROLES AGE VERSION
aks-nodepool1-14202150-0 Ready agent 26m v1.9.11
What is your max-pods set to? This is a normal error when you've reached the limit of pods per node.
You can check your current maximum number of pods per node with:
$ kubectl get nodes -o yaml | grep pods
pods: "30"
pods: "30"
And your current with:
$ kubectl get pods --all-namespaces | grep Running | wc -l
18
I hit this because I exceed the max pods, I found out how much I could handle by doing:
$ kubectl get nodes -o json | jq -r .items[].status.allocatable.pods | paste -sd+ - | bc
Check to make sure you are not hitting core limits for your subscription.
az vm list-usage --location "<location>" -o table
If you are you can request more quota, https://learn.microsoft.com/en-us/azure/azure-supportability/resource-manager-core-quotas-request

New AKS cluster unreachable via network (including dashboard)

Yesterday I spun up an Azure Kubernetes Service cluster running a few simple apps. Three of them have exposed public IPs that were reachable yesterday.
As of this morning I can't get the dashboard tunnel to work or the LoadBalancer IPs themselves.
I was asked by the Azure twitter account to solicit help here.
I don't know how to troubleshoot this apparent network issue - only az seems to be able to touch my cluster.
dashboard error log
❯❯❯ make dashboard ~/c/azure-k8s (master)
az aks browse --resource-group=akc-rg-cf --name=akc-237
Merged "akc-237" as current context in /var/folders/9r/wx8xx8ls43l8w8b14f6fns8w0000gn/T/tmppst_atlw
Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out
service+pod listing
❯❯❯ kubectl get services,pods ~/c/azure-k8s (master)
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
azure-vote-back ClusterIP 10.0.125.49 <none> 6379/TCP 16h
azure-vote-front LoadBalancer 10.0.185.4 40.71.248.106 80:31211/TCP 16h
hubot LoadBalancer 10.0.20.218 40.121.215.233 80:31445/TCP 26m
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 19h
mti411-web LoadBalancer 10.0.162.209 52.168.123.30 80:30874/TCP 26m
NAME READY STATUS RESTARTS AGE
azure-vote-back-7556ff9578-sjjn5 1/1 Running 0 2h
azure-vote-front-5b8878fdcd-9lpzx 1/1 Running 0 16h
hubot-74f659b6b8-wctdz 1/1 Running 0 9s
mti411-web-6cc87d46c-g255d 1/1 Running 0 26m
mti411-web-6cc87d46c-lhjzp 1/1 Running 0 26m
http failures
❯❯❯ curl --connect-timeout 2 -I http://40.121.215.233 ~/c/azure-k8s (master)
curl: (28) Connection timed out after 2005 milliseconds
❯❯❯ curl --connect-timeout 2 -I http://52.168.123.30 ~/c/azure-k8s (master)
curl: (28) Connection timed out after 2001 milliseconds
If you are getting getsockopt: connection timed out while trying to access to your AKS Dashboard, I think deleting tunnelfront pod will help as once you delete the tunnelfront pod, this will trigger creation of new tunnelfront by Master. Its something I have tried and worked for me.
#daniel Did rebooting the agent VM's solve your issue or are you still seeing issues?

Azure aks no nodes found

I created an azure AKS with 3 nodes(Standard DS3 v2 (4 vcpus, 14 GB memory)). I was fiddling with the cluster and created a Deployment with 1000 replicas.After this complete cluster went down.
azureuser#saa:~$ k get cs
NAME STATUS MESSAGE ERROR
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused
scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused
etcd-0 Healthy {"health": "true"}
From debugging it seems both Scheduler and Controller-manager went down. How to Fix this?
What exactly happened when created a Deployment with 1000 replicas? Should it be taken care by k8s?
Few debugging commands output:
kubectl cluster-info
Kubernetes master is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443
Heapster is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/heapster/proxy
KubeDNS is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
kubernetes-dashboard is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
Logs for kubectl cluster-info dump # http://termbin.com/e6wb
azureuser#sim:~$ az aks scale -n cg -g cognitive-games -c 4 --verbose
Deployment failed. Correlation ID: 4df797b2-28bf-4c18-a26a-4e341xxxxx. Operation failed with status: 200. Details: Resource state Failed
no nodes displayed
azureuser#si:~$ k get nodes
No resources found
Looks silly but when AKS is created in an RG, surprisingly two RGs are created one with the AKS and another one with some random hash having all the VMS. I've deleted the 2nd RG and the basic AKS stopped working.

Kubernetes on Rasperry Pi kube flannel CrashLoopBackOff and kube dns rpc error code = 2

I used this tutorial to set up a kubernetes cluster on my Raspberry 3.
I followed the instructions until the setup of flannel by:
curl -sSL https://rawgit.com/coreos/flannel/v0.7.0/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | kubectl create -f -
I get the following error message on kubectl get po --all-namespaces:
kube-system etcd-node01 1/1 Running
0 34m
kube-system kube-apiserver-node01 1/1 Running
0 34m
kube-system kube-controller-manager-node01 1/1 Running
0 34m
kube-system kube-dns-279829092-x4dc4 0/3 rpc error:
code = 2 desc = failed to start container
"de9b2094dbada10a0b44df97d25bb629d6fbc96b8ddc0c060bed1d691a308b37":
Error response from daemon: {"message":"cannot join network of a non
running container:
af8e15c6ad67a231b3637c66fab5d835a150da7385fc403efc0a32b8fb7aa165"}
15 39m
kube-system kube-flannel-ds-zk17g 1/2
CrashLoopBackOff
11 35m
kube-system kube-proxy-6zwtb 1/1 Running
0 37m
kube-system kube-proxy-wbmz2 1/1 Running
0 39m
kube-system kube-scheduler-node01 1/1 Running
Interestingly I have the same issue, installing kubernetes with flannel on my laptop with another tutorial.
Version details are here:
Client Version: version.Info{Major:"1", Minor:"6",
GitVersion:"v1.6.3",
GitCommit:"0480917b552be33e2dba47386e51decb1a211df6",
GitTreeState:"clean", BuildDate:"2017-05-10T15:48:59Z",
GoVersion:"go1.8rc2", Compiler:"gc", Platform:"linux/arm"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.3",
GitCommit:"0480917b552be33e2dba47386e51decb1a211df6",
GitTreeState:"clean", BuildDate:"2017-05-10T15:38:08Z",
GoVersion:"go1.8rc2", Compiler:"gc", Platform:"linux/arm"}
Any suggestions, that might help?
I solved this issue by generating cluster-roles before setting up the pod network driver:
curl -sSL https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel-rbac.yml | sed "s/amd64/arm/g" | kubectl create -f -
Then setting up the pod network driver by:
curl -sSL https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml | sed "s/amd64/arm/g" | kubectl create -f -
Worked for me so far...

Resources