Upgrade Service Fabric Service that Fails to Honor Cancellation Token - azure

I've got a stateful service running in a Service Fabric cluster that I now know fails to honor a cancellation token passed into it. My fault.
I'm ready to release the fix, but during the upgrade process, I'm expecting the service replica on the faulty primary node to get stuck since it won't honor the token passed in.
I can use Restart-ServiceFabricDeployedCodePackage or even Restart-ServiceFabricNode to manually take down the stuck replica, but that will result in a brief service interruption during the upgrade process.
Is there any way to release this fix with zero downtime?

This is not possible for a stateful service using the Service Fabric infrastructure, you will need to have downtime on the upgrade. Once you have a version that supports the cancellation token then you will be fine.
That said, depending on the use of the state, and if you have a load balancer between your clients and the service, you can stand up another service instance on the new fixed version and use the load balancer to drain your traffic across to then new version, upgrade the old, drain back to it and then drop the second service you created. This will allow for a zero downtime scenario.

The only workarounds I can think of are worse since they turn off parts of health checks during upgrades and "force" the process to come down. This doesn't make things more graceful or improve downtime, and has a side effect of potentially causing other health issues to be ignored.
There's always some downtime, even with the fully rolling upgrades, since swapping a primary to another node is never instantaneous and callers need to discover the new location. With those commands, you're just converting a more graceful shutdown and cleanup into a failure, which results in the same primary swap. Shouldn't be a huge difference since clients (and SF) have to deal with failure normally anyway.
I'd keep using those commands since they give you good manual control over which replicas/processes to poke when things get stuck.

Related

Aks Error Failed to drain the node, aborting scale down

I am using Kured to perform safe reboots of our nodes to upgrade the OS and kernel versions.
In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version. After the reboot, the nodes are uncordoned and back to the ready state and the temporary worker nodes get deleted.
It was perfectly fine until yesterday when one of the nodes failed to upgrade to the latest kernel version. It was on 5.4.0-1058-azure last week after a successful upgrade and it should be on 5.4.0-1059-azure yesterday after the latest patch, but it is using the old version 5.4.0-1047-azure (which I think is the version of the temporary node that got created).
Upon checking the log analytics on azure, it says that it failed to scale down.
Reason: ScaleDownFailed
Message: failed to drain the node, aborting ScaleDown
Error message
Any idea on why this is happening?
Firstly, there is a little misunderstanding of the OS and Kernel patching process.
In my understanding, it works by cordoning and draining the node, and the pods are scheduled on a new node with the older version.
The new node that is/are added should come with the latest node image version with latest security patches (which usually does not fall back to an older kernel version) available for the node pool. You can check out the AKS node image releases here. Reference
However, it is not necessary that the pod(s) evicted by the drain operation from the node that is being rebooted at any point during the process has to land on the surge node. Evicted pod(S) might very well be scheduled on an existing node should the node fit the bill for scheduling these pods.
For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described here.
The documentation, at the time of writing, might be a little misleading on this.
About the error:
Reason: ScaleDownFailed
Message: failed to drain the node, aborting ScaleDown
This might happen due to a number of reasons. Common ones might be:
The scheduler could not find a suitable node to place evicted pods and the node pool could not scale up due to insufficient compute quota available. [Reference]
The scheduler could not find a suitable node to place evicted pods and the cluster could not scale up due to insufficient IP addresses in the node pool's subnet. [Reference]
PodDisruptionBudgets (PDBs) did not allow for at least 1 pod replica to be moved at a time causing the drain/evict operation to fail. [Reference]
In general,
The Eviction API can respond in one of three ways:
If the eviction is granted, then the Pod is deleted as if you sent a DELETE request to the Pod's URL and received back 200 OK.
If the current state of affairs wouldn't allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests, but here we mean that this request isn't allowed right now but it may be allowed later.
If there is some kind of misconfiguration; for example multiple PodDisruptionBudgets that refer the same Pod, you get a 500 Internal Server Error response.
For a given eviction request, there are two cases:
There is no budget that matches this pod. In this case, the server always returns 200 OK.
There is at least one budget. In this case, any of the three above responses may apply.
Stuck evictions
In some cases, an application may reach a broken state, one where unless you intervene the eviction API will never return anything other than 429 or 500.
For example: this can happen if ReplicaSet is creating Pods for your application but the replacement Pods do not become Ready. You can also see similar symptoms if the last Pod evicted has a very long termination grace period.
How to investigate further?
On the Azure Portal navigate to your AKS cluster
Go to Resource Health on the left hand menu as shown below and click on Diagnose and solve problems
You should see something like the following
If you click on each of the options, you should see a number of checks loading. You can set the time frame of impact on the top right hand corner of the screen as shown below (Please press the Enter key after you have set the correct timeframe). You can click on the More Info link on the right hand side of each entry for detailed information and recommended action.
How to mitigate the issue?
Once you have identified the issue and followed the recommendations to fix the same, please perform an az aks upgrade on the AKS cluster to the same Kubernetes version it is currently running. This should initiate a reconcile operation wherever required under the hood.

'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure Kubernetes server? (AKS)

My question (to MS and anyone else) is: Why is this issue occurring and what work around can be implemented by the users / customers themselves as opposed to by Microsoft Support?
There have obviously been 'a few' other question about this issue:
Managed Azure Kubernetes connection error
Can't contact our Azure-AKS kube - TLS handshake timeout
Azure Kubernetes: TLS handshake timeout (this one has some Microsoft feedback)
And multiple GitHub issues posted to the AKS repo:
https://github.com/Azure/AKS/issues/112
https://github.com/Azure/AKS/issues/124
https://github.com/Azure/AKS/issues/164
https://github.com/Azure/AKS/issues/177
https://github.com/Azure/AKS/issues/324
Plus a few twitter threads:
https://twitter.com/ternel/status/955871839305261057
TL;DR
Skip to workarounds in Answers below.
Current best solution is post a help ticket — and wait — or re-create your AKS cluster (maybe more than once, cross your fingers, see below...) but there should be something better. At least please grant the ability to let AKS preview customers, regardless of support tier, upgrade their support request severity for THIS specific issue.
You can also try scaling your Cluster (assuming that doesn't break your app).
What about GitHub?
Many of the above GitHub issues have been closed as resolved but the issue persists. Previously there was an announcements document regarding the problem but no such status updates are currently available even though the problem continues to present itself:
https://github.com/Azure/AKS/tree/master/annoucements
I am posting this as I have a few new tidbits that I haven't seen elsewhere and I am wondering if anyone has ideas as far as other potential options for working around the issue.
Affected VM / Node Resource Usage
The first piece I haven't seen mentioned elsewhere is Resource usage on the nodes / vms / instances that are being impacted by the above Kubectl 'Unable to connect to the server: net/http: TLS handshake timeout' issue.
Production Node Utilization
The node(s) on my impacted cluster look like this:
The drop in utilization and network io correlates strongly with both the increase in disk utilization AND the time period we began experiencing the issue.
The overall Node / VM utilization is generally flat prior to this chart for the previous 30 days with a few bumps relating to production site traffic / update pushes etc.
Metrics After Issue Mitigation (Added Postmortem)
To the above point, here are the metrics the same Node after Scaling up and then back down (which happened to alleviate our issue, but does not always work — see answers at bottom):
Notice the 'Dip' in CPU and Network? That's where the Net/http: TLS issue impacted us — and when the AKS Server was un-reachable from Kubectl. Seems like it wasn't talking to the VM / Node in addition to not responding to our requests.
As soon as we were back (scaled the # nodes up by one, and back down — see answers for workaround) the Metrics (CPU etc) went back to normal — and we could connect from Kubectl. This means we can probably create an Alarm off of this behavior (and I have a issue in asking about this on Azure DevOps side: https://github.com/Azure/AKS/issues/416)
Node Size Potentially Impacts Issue Frequency
Zimmergren over on GitHub indicates that he has less issues with larger instances than he did running bare bones smaller nodes. This makes sense to me and could indicate that the way the AKS servers divy up the workload (see next section) could be based on the size of the instances.
"The size of the nodes (e.g. D2, A4, etc) :)
I've experienced that when running A4 and up, my cluster is healther than if running A2, for example. (And I've got more than a dozen similar experiences with size combinations and cluster failures, unfortunately)." (https://github.com/Azure/AKS/issues/268#issuecomment-375715435)
Other Cluster size impact references:
giorgited (https://github.com/Azure/AKS/issues/268#issuecomment-376390692)
An AKS server responsible for more smaller Clusters may possibly get hit more often?
Existence of Multiple AKS Management 'Servers' in one Az Region
The next thing I haven't seen mentioned elsewhere is the fact that you can have multiple Clusters running side by side in the same Region where one Cluster (production for us in this case) gets hit with 'net/http: TLS handshake timeout' and the other is working fine and can be connected to normally via Kubectl (for us this is our identical staging environment).
The fact that users (Zimmergren etc above) seem to feel that the Node size impacts the likelihood that this issue will impact you also seems to indicate that node size may relate to the way the sub-region responsibilities are assigned to the sub-regional AKS management servers.
That could mean that re-creating your cluster with a different Cluster size would be more likely to place you on a different management server — alleviating the issue and reducing the likelihood that multiple re-creations would be necessary.
Staging Cluster Utilization
Both of our AKS Clusters are in U.S. East. As a reference to the above 'Production' Cluster metrics our 'Staging' Cluster (also U.S. East) resource utilization does not have the massive drop in CPU / Network IO — AND does not have the increase in disk etc. over the same period:
Identical Environments are Impacted Differently
Both of our Clusters are running identical ingresses, services, pods, containers so it is also unlikely that anything a user is doing causes this problem to crop up.
Re-creation is only SOMETIMES successful
The above existence of multiple AKS management server sub-regional responsibilities makes sense with the behavior described by other users on github (https://github.com/Azure/AKS/issues/112) where some users are able to re-create a cluster (which can then be contacted) while others re-create and still have issues.
Emergency could = Multiple Re-Creations
In an emergency (ie your production site... like ours... needs to be managed) you can PROBABLY just re-create until you get a working cluster that happens to land on a different AKS management server instance (one that is not impacted) but be aware that this may not happen on your first attempt — AKS cluster re-creation is not exactly instant.
That said...
Resources on the Impacted Nodes Continue to Function
All of the containers / ingresses / resources on our impacted VM appear to be working well and I don't have any alarms going off for up-time / resource monitoring (other than the utilization weirdness listed above in the graphs)
I want to know why this issue is occurring and what work around can be implemented by the users themselves as opposed to by Microsoft Support (currently have a ticket in). If you have an idea let me know.
Potential Hints at the Cause
https://github.com/Azure/AKS/issues/164#issuecomment-363613110
https://github.com/Azure/AKS/issues/164#issuecomment-365389154
Why no GKE?
I understand that Azure AKS is in preview and that a lot of people have moved to GKE because of this problem (). That said my Azure experience has been nothing but positive thus far and I would prefer to contribute a solution if at all possible.
And also... GKE occasionally faces something similar:
TLS handshake timeout with kubernetes in GKE
I would be interested to see if scaling the nodes on GKE also solved the problem over there.
Workaround 1 (May Not Work for Everyone)
An interesting solution (worked for me) to test is scaling the number of nodes in your cluster up, and then back down...
Log into the Azure Console — Kubernetes Service blade.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Alternately you can (maybe) do this from the command line:
az aks scale --name <name-of-cluster> --node-count <new-number-of-nodes> --resource-group <name-of-cluster-resource-group>
Since it is a finicky issue and I used the web interface I am uncertain if the above is identical or would work.
Total time it took me ~2 minutes — for my situation that is MUCH better than re-creating / configuring a Cluster (potentially multiple times...)
That being Said....
Zimmergren brings up some good points that Scaling is not a true Solution:
"It worked sometimes, where the cluster self-healed a period after scaling. It failed sometimes with the same errors. I don't consider scaling a solution to this problem, as that causes other challenges depending on how things are set up. I wouldn't trust that routine for a GA workload, that's for sure. In the current preview, it's a bit wild west (and expected), and I'm happy to blow up the cluster and create a new one when this fails continuously." (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
Azure Support Feedback
Since I had a support ticket open at the time I ran into the above scaling solution I was able to get feedback (or rather a guess) on what the above might have worked, here's a paraphrased response:
"I know that scaling the cluster can sometimes help if you get into a state where the number of nodes is mismatched between “az aks show” and “kubectl get nodes”. This may be similar."
Workaround References:
GitHub user Scaled nodes from console and fixed the problem: https://github.com/Azure/AKS/issues/268#issuecomment-375722317
Workaround Didn't Work?
If this DOES NOT work for you, please post a comment below as I am going to try to keep an up to date list of how often the issue crops up, whether it resolves itself, and whether this solution works across Azure AKS users (looks like it doesn't work for everyone).
Users Scaling Up / Down DID NOT work for:
omgsarge (https://github.com/Azure/AKS/issues/112#issuecomment-395231681)
Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
sercand — scale operation itself failed — not sure if it would have impacted connectability (https://github.com/Azure/AKS/issues/268#issuecomment-395301296)
Scaling Up / Down DID work for:
Me
LohithChanda (https://github.com/Azure/AKS/issues/268#issuecomment-395207716)
Zimmergren (https://github.com/Azure/AKS/issues/268#issuecomment-395299308)
Email Azure AKS Specific Support
If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help#service.microsoft.com
Adding another answer since this is now the Azure Support official solution when the above attempts do not work. I haven't experienced the issue in a while so I can't verify this one but it seems like it would make sense to me (based on previous experience).
Credit on this one / full thread found here (https://github.com/Azure/AKS/issues/14#issuecomment-424828690)
Check for Tunneling Issues
ssh to the agent node which running the tunnelfront pod
get tunnelfront logs: "docker ps" -> "docker logs "
"nslookup " whose fqdn can be get from above command -> if it resolves ip, which means dns works, then go to the following step
"ssh -vv azureuser# -p 9000" ->if port is working, go to the next step
"docker exec -it /bin/bash", type "ping google.com", if it is no response, which means tunnel front pod doesn't have external network, then do following step
restart kube-proxy, using "kubectl delete po -n kube-system", choose the kube-proxy which is runing on the same node with tunnelfront. customer can use "kubectl get po -n kube-system -o wide"
I feel like this particular work-around could PROBABLY be automated (for sure on Azure side but probably on the community side).
Email Azure AKS Specific Support
If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help#service.microsoft.com
Workaround 2 Re-Create Cluster (Somewhat Obvious)
I am adding this one because there are some details to keep in mind and even though I touched on it in my original Question, that thing got long, so I am adding specific details on re-creation here.
Cluster Re-Creation Doesn't Always Work
Per the above in my original question there are multiple AKS Server instances that divide up responsibilities for a given Azure region (we think). Some, or all, of these can be impacted by this bug resulting in your Cluster being un-reachable via Kubectl.
That means that if you re-create your Cluster and it some how lands on the same AKS server, probably that new Cluster will ALSO not be reachable requiring...
Additional Re-creation Attempts
Probably re-creating multiple times will result in you eventually landing your new Cluster on one of the other AKS servers (which is working fine).
As far as I can tell I don't see any indication that ALL AKS servers get hit with this problem at once in a while (if ever).
Different Cluster Node Size
If you are in a pinch and want the highest possibly probability (we haven't confirmed this) that your re-creation lands on a different AKS management server — choose a different Node size when you create your new Cluster (see Node Size section of the initial Question above).
I have opened this ticket asking Azure DevOps whether or not the Node Size is ACTUALLY related to deciding which Clusters are administered by which AKS management servers: https://github.com/Azure/AKS/issues/416
Support Ticket Fix vs. Self Healing
Since there are a lot of users who indicate that the problem occasionally solves itself and just goes away I think that it is reasonable to guess that Support actually fixes the offending AKS server (which may result in other users having their Clusters fixed — 'Self Heal') as opposed to fixing the individual user's Cluster.
Creating Support Tickets
To me the above would likely mean that creating a Ticket is probably a good thing since it would fix other user Clusters experiencing the same issue — it might also be an argument for allowing support issue severity escalation for this specific issue.
I think this is also a decent indicator that maybe Azure support hasn't figured out how to fully alarm for the problem yet, in which case creation of a support ticket serves that purpose as well.
I also asked Azure DevOps whether they Alarm for the issue (based on my experience easily visualizing the issue based on CPU and Network IO metric changes) on their side: https://github.com/Azure/AKS/issues/416
If NOT (haven't heard back) then it makes sense to create a ticket EVEN IF you plan to re-create your cluster since that ticket would then make Azure DevOps aware of the issue resulting in a fix for other users on that Cluster management server.
Things to make Cluster Re-Creation Easier
I will add to this (feedback / ideas are appreciated) but off the top of my head:
Be diligent (obvious) about how you store all YAML files used to create your Cluster (even if you don't re-deploy often for your app by design).
Script your DNS modifications in order to speed up pointing to the new instance — If you have a public facing app / service that utilizes DNS (Maybe something like this example for Google Domains?: https://gist.github.com/cyrusboadway/5a7b715665f33c237996, Full docs here: https://cloud.google.com/dns/api/v1/)
We just had this issue for one of our clusters. Sent a support ticket and got called back 5 minutes later by an engineer asking if it was OK for them to restart the API Server. 2 minutes later it was working again.
Reason was something about timeouts in their messaging queue.

Make Node/MEANjs Highly Available

I'm probably opening up a can of worms with regard to how many hundreds of directions can be taken with this- but I want high availability / disaster recovery with my MEANjs servers.
Right now, I have 3 servers:
MongoDB
App (Grunt'ing the main application, this is the front end
server)
A third server for other processing on the back-end
So at the moment, if I reboot my MongoDB server (or more realistically, it crashes for some reason), I suddenly see this in my App server terminal:
MongoDB connection error: Error: failed to connect to
[172.30.3.30:27017] [nodemon] app crashed - waiting for file changes
before starting...
After MongoDB is back online, nothing happens on the app server until I re-grunt.
What's the best practice for this situation? You can see in the error I'm using nodeMon to monitor changes to the app. I bet upon init I could get my MongoDB server to update a file on the app server within nodemon's view to force a restart? Or is there some other tool I can use for this? Or should I be handling my connections to the db server more gracefully so the app doesn't "crash"?
Is there a way to re-direct to a secondary mongodb in case the primary isn't available? This would be more apt to HA/DR type stuff.
I would like to start with a side note: Given the description in the question and the comments to it, I am not convinced that using AWS is a wise option. A PaaS provider like Heroku, OpenShift or AppFog seems to be more suitable, especially when combined with a MongoDB service provider. Running MongoDB on EBS can be quite a challenge when you are new to MongoDB. And pretty expensive, too, as soon as you need provisioned IOPS.
Note In the following paragraphs, I simplified a few things for the sake of comprehensibility
If you insist on running it on your own, however, you have an option. MongoDB itself comes with means of automatic, transparent failover, called a replica set.
A minimal replica set consists of of two data bearing nodes and a so called arbiter. Write operations go to the node currently elected "primary" only, and reads do, too, unless you explicitly allow or request reads to be performed on the current "secondary". The secondary constantly syncs to the primary. If the current primary goes down for some reason, the former secondary becomes elected primary.
The arbiter is there so that there is always a quorum (qualified majority would be an equivalent term) of members to elect the current secondary to be the new primary. This quorum is mainly important for edge cases, but since you can not rule out these edge cases, an uneven number of members is a hard requirement for a MongoDB replica set (setting aside some special cases).
The beauty of this is that almost all drivers, and the node.js for sure, are replica set aware and deal with the failover procedure pretty gracefully. They simply send the reads and writes to the new primary, without any change to be done at any other point.
You only need to deal with some cases during the failover process. Without going into much detail, you basically check for certain errors in the according callbacks and redo the operation, if you encounter one of those errors and redoing the operation is feasible.
As you might have noticed, the third member, the arbiter, does not hold much data. It is a very lightweight process and can basically run on the cheapest instance you can find.
So you have data replication and automatic, transparent failover with relative ease at the cost of the cheapest VM you can find, since you would need two data bearing nodes anyway if you used any other means.

How do I cause Azure move my instance if I decide the VM is faulty?

Suppose I have the following situation. One of my Azure role instances happens to be started on a VM that runs inside a faulty server but Azure wiring processes don't see any problems. I somehow deduce this fact - for example I see an "impossible" call stack - one that can't happen in my program under any normal conditions.
So I'd like Azure to move my instance to another VM and have the underlying hardware checked and repaired.
How can I do that except contacting support?
A few comments:
You can have this done, sortof, by calling support. The support team won't move your VM to a new server just because you ask, but they will work with you to determine if the physical server really is bad, and if so move it to out of service.
RequestRecycle will only shut down the host process (ie. WaIISHost) and related processes and then restart them. It won't reboot the VM, clean boot, or redeploy.
You can try a 'Reimage' from the portal or Powershell if you suspect you might have a corrupt Windows installation. A Reimage will recreate the Windows partition from scratch.
In order to force a new VM to be on a new server you would have to do an in-place upgrade and modify the size of the VM (ie. go from Small to Medium). This will cause new VMs to be created on new servers. You can then do another in-place upgrade to revert back to the original size.
That being said, I strongly agree with Brian's comment that it is very unlikely that bad hardware is causing an 'impossible' callstack. I would recommend opening a support incident so you can find the actual root cause instead of just fixing the most visible symptom.
I don't think you can move a VM. But you could create a new staging deployment, swap it into production, and then destroy the old one. You can't actually guarantee that the VMs are on different physical machines, but it seems reasonably likely. The larger the VMs are, the more likely that they're on separate servers.
That said, it seems really unlikely that your problems are due to a hardware fault rather than some subtle bug.

Azure Development - How to stop a Web Role instance

I need to test how my code will handle the failure of a web role instance in a development environment.
How do I terminate one of the instances? I can't see any option in the UI for this. Seems like a strange ommission
Update
The issue is relating to a distributed cache layer (I know that azure offers their own)
I want to be able to test how the system reacts to a missing or additional node etc
Prehaps my real question is
how up to date is RoleEnvironment.CurrentRoleInstance.Role.Instances
The need to simulate ungraceful exits in the dev emulator usually is done because you are doing something in your web role that is stateful or long running. That is generally discouraged, but sometimes is unavoidable.
I suspect the best way to simulate the a failure is to kill processes. If you open task manager (or better Process Explorer), you will see "WatDebugger" hosting either "WaIISHost" or "WaWorkerHost". If you kill this process, I think it will simulate a failure.
Honestly, it is easier to test this one in the cloud however. You can RDP into one of the instances and kill the 'WaAppAgent' process. That will kill your RoleEntryPoint and fabric controller agent. That will be a true ungraceful failure.
By failure, do you mean becoming unavailable? It should be seamless because the next request would simply be handled by one of the other instances. As long as there is one instance available Azure will route calls to that instance.
This is the nature of a high-available system, requests are handled by the available instances. This is why you have multiple instances in the first place, to handle requests in the case of failure in one or more instances.
This is why you need to always be watchful of how your application handles state. State needs to be maintained outside of the instance, either in queues or in a database. This ensures that any process can pickup a piece of work and execute against it.
There is another question dealing with Session State that should help: How does Microsoft Azure handle Session State?
By terminate an instance do you mean reducing instance count and see which one gets killed? I like Ryan's view about ungraceful exits, but if it's forced kill by the fabric it'll be a different ball game.

Resources