Service fabric Cluster provisioning fail after add secondary certificate through resource manager - azure

I was trying to swap the certificate of my service fabric cluster because the previous certificate was about to expire. Searching in the web i've found a way to add a secondary certificate to the cluster through azure resources manager.
So i added the certificate in my key vault and after that i added the certificate thumbprint to the cluster using the resources manager as a secondary certificate, till here everything is ok.
The problem happened when i tried to swap the two certificates through the azure portal, my cluster has entered in a state of cluster provisioning fail, and after that i can't make any changes to my cluster, it continue giving me the same error that the cluster have a pending change.
Bellow the description of the error:
statusCode:BadRequest
serviceRequestId:dcb6f784-018e-4789-ac4d-4426bd68b66c
statusMessage:{"error":{"code":"PendingClusterUpgradeCannotBeInterrupted","message":"The
cluster is going through a an upgrade which cannot be
interrupted.","details":[]}}
responseBody:{"error":{"code":"PendingClusterUpgradeCannotBeInterrupted","message":"The
cluster is going through a an upgrade which cannot be
interrupted.","details":[]}}
Someone already had this problem before ?

Related

AKS Cluster deployment fails with "ReconcileMSICredentialError"

When I try to deploy a fresh AKS cluster with "Dev/Test" Settings via the Portal, I get the following error while deployment:
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed.
Please list deployment operations for details. Please see
https://aka.ms/DeployOperations for usage details.","details":
[{"code":"ReconcileMSICredentialError","message":"Reconcile MSI credential failed.
Details: autorest/azure: Service returned an error. Status=409 Code=\"Conflict\"
Message=\"Secret bf905bf9e9ad86526b26e98d2ea490a0a500ff23907f9df987d95de3a649e751 is
currently being deleted and cannot be re-created; retry later.\" InnerError=
{\"code\":\"ObjectIsBeingDeleted\"}."}]}
However, the resource still gets deployed, but with a notification that "the resource is in a failed state". When I stop the cluster and start it new, the notification disappears but I'm not sure if the error remains.
I can avoid the error altogether, if I pick a new name for the cluster. However, I'd like to keep the old name.
The same happens when I deploy with different settings (CPU, number of nodes, etc.). I also tried deleting the cluster entirely and deploying it completely new but the error persists. I haven't found any explanation to this error either on Stackoverflow or Google.
What could be the reason for this error and how to avoid it?
I tried to reproduce the same issue in my environment and got the below results
I have created the AKS cluster with dev/test environment
The reference cluster is successfully created
I have given the some credentials to the cluster using below command
az aks get-credentials --resource-group Alldemorg --name cluster_name
*Created the sample application and deployed that application into the cluster,
I have used the following Reference for example sample file.*
Deployment got succeeded and I am able to see all the pods and nodes which got created for the application
Note:
1). "ReconcileMSICredentialError" error we are getting because of the version please check the version and upgrade to latest
2). If the resource is in failed stated delete the entire resource instead of deleting cluster and create it again if we stop and start the resource may chance of getting "ReconcileMSICredentialError".

After removing nodes from Azure service fabric cluster, cluster status is ‘Upgrade service unreachable’

5 node Azure VMSS (which is associated with Azure Service Fabric Cluster) has been 'scaled in' to 1 node. After this 'scale in' process, VMSS is up & running and healthy.
But the Azure Service Fabric Cluster (which uses this VMSS) entered in to ‘Upgrade service unreachable’ status & it shows '0' node & '0' application. Also can't able to access the service fabric explorer web page.
Even after 'scaling out' VMSS back to 5 nodes, no change in service fabric status but still remains in inoperable state. How to bring back Azure service fabric to operable state without rebuild the cluster from the scratch? What would be root cause for Service Fabric inoperable state, is it because of scaling down its VMSS to 1 node? (or) because of its couple of 'seed' nodes got removed?
Tried out deallocating & restarting the VMSS node, upgraded VM size in VMSS from 'Standard_DS1_v1’ to 'Standard_DS1_v2’ but no luck
Service Fabric Version: 7.1.458.9590
Exception:
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check Error if there is connectivity/firewall/DNS issue.

Azure webapp not updating certificate on keyvault

I have a webapp running on Azure and it gets its SSL certificate from Keyvault.
I've updated the certificate on keyvault a week ago and the web app is still using the old one.
According to Azure doc, the webapp checks for new certificates regularly
Here is what I see on Azure KeyVault -> Certificates:
Here is the certificate on my webapp:
The certificate was attached with Azure ARM template:
{
"type":"Microsoft.Web/certificates",
"name":"[parameters('environmentConfiguration').Certificate]",
"apiVersion":"2016-03-01",
"location":"[resourceGroup().location]",
"properties":{
"keyVaultId":"[variables('keyVaultId')]",
"keyVaultSecretName":"[parameters('environmentConfiguration').Certificate]",
"serverFarmId": "[resourceId(variables('serverFarmResourceGroup'), 'Microsoft.Web/serverfarms', variables('serverFarmName'))]"
},
How to troubleshoot this kind of problems?
The web app is still using the old one after you have updated a week ago. The possible cause is as below:
The Web Apps feature of Azure App Service runs a background job
every eight hours and syncs the certificate resource if there are any
changes. When you rotate or update a certificate, sometimes the
application is still retrieving the old certificate and not the newly
updated certificate. The reason is that the job to sync the
certificate resource hasn't run yet.
Solution:
You can force a sync of the certificate. select the certificate from App Service Certificates.Select Rekey and Sync, and then select Sync. The sync takes some time to finish.When the sync is completed, you see the following notification: "Successfully updated all the resources with the latest certificate."
Update
Please verify if the configuration of the new certificate is correct referring to this.
Please check the Prerequisites, Deploying Key Vault Certificate into Web App, Rotating Certificate referring this blog: deploying Azure Web App Certificate through Key Vault.

Unable to connect AKS cluster: connection time out

I've created an AKS cluster in the UK region in Azure.
Currently, I can no longer access my AKS cluster. Connecting to the public IPs fails; all connections time out.
Furthermore, I can't run the kubectl command either:
fcarlier#ubuntu:~$ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
Is there a known issue with AKS in that region or is it something on my side?
Is there a known issue with AKS in that region or is it something on
my side?
Sorry to give you a bad experience.
For now, Azure AKS still in preview, please try to recreate it, ukwest works fine now.
Here is a similar case about you, please refer to it.
I just successfully created a single node AKS cluster on UK West with no issues. Can you please retest? For now, I would avoid provisioning on West US 2 until the threshold issues are fixed. I'm aware the AKS team is actively engaged to restore service on West US. Sorry for the inconvenience. Below is the sample cmd to create in UK if you need the reference. Hope this helps.
Create Resource Group (UK West):az group create --name myResourceGroupUK --location ukwest
Create AKS cluster in (UK west):az aks create --resource-group myResourceGroupUK --name myK8sClusterUK --agent-count 1 --generate-ssh-keys
I just finished a big post over here on this topic (which is not as straight forward as a single solution / workaround): 'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)
That being said, the solution to this one for me was to scale the nodes up — and then back down — for my impacted Cluster from the Azure Kubernetes service blade web console.
Workaround / Potential Solution
Log into the Azure Console — Kubernetes Service blade.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Also added this solution to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

how to upload a certificate in VM of azure cluster

This line is creating problem as it requires the Cert to be present in the machine in which it is currently executing..
topologyConfigurationManager = new TopologyConfigurationManager(new Uri("https://int2.metrics.nsatc.net"), GenevaCertThumbprint, StoreLocation.LocalMachine, TimeSpan.FromMinutes(1));
I have gone through this link deploying-application-certificates-to-the-cluster
but still i am not able to get how to upload certificate in VM(nodes) of azure cluster.Can some one give me detailed step of where to upload the cert(.pfx file).
I had this same problem few days ago, i was needing to change to a new certificate because the old has expired, and i solved it by deploying the azure resource template for service fabric again, which means that i'd basically recreated the all environment.
In the template i've only changed the certificate link and the thumbprint.
Finally Got the answer::
Login to the Node of Remote cluster using following command in cmd:mstsc /v:mycluster.eastus.cloudapp.azure.com:3389
Where "mycluster.eastus.cloudapp.azure.com" is cluster name.After logging in Install certifcates Manually.
3389-is first node 3390-second node and so on.

Resources