Unable to connect AKS cluster: connection time out - azure

I've created an AKS cluster in the UK region in Azure.
Currently, I can no longer access my AKS cluster. Connecting to the public IPs fails; all connections time out.
Furthermore, I can't run the kubectl command either:
fcarlier#ubuntu:~$ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
Is there a known issue with AKS in that region or is it something on my side?

Is there a known issue with AKS in that region or is it something on
my side?
Sorry to give you a bad experience.
For now, Azure AKS still in preview, please try to recreate it, ukwest works fine now.
Here is a similar case about you, please refer to it.

I just successfully created a single node AKS cluster on UK West with no issues. Can you please retest? For now, I would avoid provisioning on West US 2 until the threshold issues are fixed. I'm aware the AKS team is actively engaged to restore service on West US. Sorry for the inconvenience. Below is the sample cmd to create in UK if you need the reference. Hope this helps.
Create Resource Group (UK West):az group create --name myResourceGroupUK --location ukwest
Create AKS cluster in (UK west):az aks create --resource-group myResourceGroupUK --name myK8sClusterUK --agent-count 1 --generate-ssh-keys

I just finished a big post over here on this topic (which is not as straight forward as a single solution / workaround): 'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)
That being said, the solution to this one for me was to scale the nodes up — and then back down — for my impacted Cluster from the Azure Kubernetes service blade web console.
Workaround / Potential Solution
Log into the Azure Console — Kubernetes Service blade.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Also added this solution to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

Related

Azure DevOps Release Pipeline || To sign in, use a web browser to open

I created the aks cluster with azure service principal id and i provided the contributer role according to the subscription and resource group.
For each and every time when i executed the pipeline the sign-in is asking and after i authenticated it is getting the data.
Also the "kubectl get" task is taking more than 30 min and is getting "Kubectl Server Version: Could not find kubectl server version"
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code CRA2XssWEXUUA to authenticate
Thanks in advance
What is the version of the created cluster?
I'm assuming from your snapshot that you are using az in order to get credentials for it.
Old azure auth plugin is deprecated in V1.22+. If you are using V1.22 or above you should use kubelogin in order authenticate.
You will also need to update your kube config accordingly:
kubelogin convert-kubeconfig
and specifically if you're logging via az:
kubelogin convert-kubeconfig -l azurecli
Note that the flag -l azurecli is important here: the default value is "devicecode" which will not consider your az as a logging method - and you will still be requested a browser authentication.
Alternatively, you can set environment variable:
AAD_LOGIN_METHOD=azurecli
Because you are getting sign in request and not the deprecation warning for the auth plugin, I suspect that you already have kubelogin installed on your agent, and you just need to update the kube config file
What task are you using? There is official kubectl task: https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/deploy/kubernetes?view=azure-devops
It requires the service connection.
If you still want to execute kubectl directly, you should run the following before the kubectl inside the AzureCLI task:
az aks get-credentials --resource-group "$(resourceGroup)" --name "$(k8sName)" --overwrite-existing
Please use Selfhosted agents for executing your commands. looks like you have private endpoints for your AKS and requests are only allowed from trusted devices.
I ran into the same issue and for me the fix was to change the Connection Type in the stage definition from Azure Resource Manager to Kubernetes Service Connection - check on the screenshot below.
Then you should be able to also specify the connection type in each of the tasks where you are running kubectl or helm commands. For example, in a kubectl task, under Kubernetes Cluster --> Service connection type use the Kubernetes Service Connection:
As mentioned by #DevOpsEngg, the problem could be related to private endpoints but I wouldn't say that it is regarding selfhosted agents, because I'm using these. As an extra comment - this started happening when I added more than one user to the cluster, so you might want to check user permissions and authentication. Unfortunately, I'm still getting used to K8s so I don't have more info about that.

Pull images from an Azure container registry to a Kubernetes cluster

I have followed this tutorial microsoft_website to pull images from an azure container. My yaml successfully creates a pod job, which can pull the image, BUT only when it runs on the agentpool node in my cluster.
For example, adding nodeName: aks-agentpool-33515997-vmss000000 to the yamlworks fine, but specifying a different node name, e.g. nodeName: aks-cpu1-33515997-vmss000000, the pod fails. The error message I get with describe pods is Failed to pull image and then kubelet Error: ErrImagePull.
What I'm missing?
Create secret:
kubectl create secret docker-registry <secret-name> \
--docker-server=<container-registry-name>.azurecr.io \
--docker-username=<service-principal-ID> \
--docker-password=<service-principal-password>
As #user1571823 told solution to the problem is deleting the old image from the acr and creating/pushing a new one.
The problem was related to some sort of corruption in the image saved in the azure container registry (acr). The reason why one agent pool could pulled the image was actually because the image already existed in the VM.
Henceforth as #andov said it is good option to open an incident case to Azure support for AKS from your subscription, where AKS is deployed. The support team has full access to the AKS service backend and they can tell exactly what was causing your problem.
Four things to check:
Is it a subscription issue? Are the nodes in different subscriptions?
Is it a rights issue? Does the service principle of the node have rights to pull the image.
Is it a network issue? Are the nodes on different subnets?
Is there something with the image size or configuration, that means that it cannot run on the other cluster.
Edit
New-AzAksNodePool has a parameter -DefaultProfile
It can be AzContext, AzureRmContext, AzureCredential
If this is different between your nodes it would explain the error

How to get old rotated logs in Azure AKS Kubernetes (for Hyperledger Fabric peer nodes)

I am running a Hyperledger Fabric network in Azure. I see that if I try to get logs of the fabric peer nodes using "kubectl logs..." command I only have the last 24 hours (aprox). AKS is probably rotating them. How can I get previous day logs of these pods?
It dependes on the policy and configuration of your AKS service. If you want to get more logs, you can use the since-time option, to determine them, as you can see in this example:
kubectl logs nginx-78f5d695bd-czm8z --since-time=2018-11-01T15:00:00Z
On the other hand, you can also configure the LogAnalytics service of Azure and define a retentio policy to maintain your logs stored, as long as you want.

Unable to pull image from Azure Container Registry

We recently had an issue with our Azure Kubernetes Cluster not reporting back any data through the Azure Portal. To fix this, I updated the Kubernetes version to the latest version as was recommended on GitHub. After the upgrade was complete, we were able to view logs and monitoring data through the portal, but one of the containers stored in our Azure Container Registry is not able to be pulled by the Kubernetes Cluster.
The error I see in the Kuberenetes Management page is:
Failed to pull image "myacr.azurecr.io/container:190305.191": [rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/mycontainer/manifests/190305.191: unauthorized: authentication required, rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/mycontainer/manifests/190305.191: unauthorized: authentication required]
My original setup used the first script provided in this document and it worked correctly without issue. Once I started receiving the error, I ran it again just to make sure.
Once I saw that failed, I then deleted the account from the permissions on both the ACR and the AKS. Again, it failed to pull the image.
After that, I tried using the second method of creating an Kubernetes secret and received the same error.
At this point, I'm unsure what else to check. I've verified that I can run docker pull on my machine and pull the image, but there seems to be a breakdown between the AKS and the ACR that I can not sort out.
It's been a while since I originally posted this, but I did stumble across a currently stable solution to the problem.
The service principal, for whatever reason, is not able to maintain a connection to the ACR. So if your cluster ever goes down, you lose the ability to pull from the ACR. I had this happen multiple times over the last year and as I moved more of my Kubernetes deployment to Azure, it became a bigger and bigger issue.
I stumbled across this Microsoft Doc and noticed the mention of the --attach-acr command.
This is what the full command looks like:
az aks create -n myAKSCluster -g myResourceGroup --generate-ssh-keys --attach-acr $MYACR
Since setting it up with that flag, I have had 0 issues with it.
knock on wood

Service fabric Cluster provisioning fail after add secondary certificate through resource manager

I was trying to swap the certificate of my service fabric cluster because the previous certificate was about to expire. Searching in the web i've found a way to add a secondary certificate to the cluster through azure resources manager.
So i added the certificate in my key vault and after that i added the certificate thumbprint to the cluster using the resources manager as a secondary certificate, till here everything is ok.
The problem happened when i tried to swap the two certificates through the azure portal, my cluster has entered in a state of cluster provisioning fail, and after that i can't make any changes to my cluster, it continue giving me the same error that the cluster have a pending change.
Bellow the description of the error:
statusCode:BadRequest
serviceRequestId:dcb6f784-018e-4789-ac4d-4426bd68b66c
statusMessage:{"error":{"code":"PendingClusterUpgradeCannotBeInterrupted","message":"The
cluster is going through a an upgrade which cannot be
interrupted.","details":[]}}
responseBody:{"error":{"code":"PendingClusterUpgradeCannotBeInterrupted","message":"The
cluster is going through a an upgrade which cannot be
interrupted.","details":[]}}
Someone already had this problem before ?

Resources