I am currently testing how Azure Kubernetes handles failover for StatefulSets. I simulated a network partition by running sudo iptables -A INPUT -j DROP on one of my nodes, not perfect but good enough to test some things.
1). How can I reuse disks that are mounted to a failed node? Is there a way to manually release the disk and make it available to the rescheduled pod? It takes forever for the resources to be released after doing a force delete, sometimes this takes over an hour.
2). If I delete a node from the cluster all the resources are released after a certain amount of time. The problem is that in the Azure dashboard it still displays my cluster as using 3 nodes even if I have deleted one. Is there a way to manually add the deleted node back in or do I need to rebuild the cluster each time?
3). I most definitely do not want to use ReadWriteMany.
Basically what I want is for my StatefulSet pods to terminate and have the associated disks detach and then reschedule on a new node in the event of a network partition or a node failure. I know the pods will terminate in the event of a recovery from a network partition but I want control over the process myself or at least have it happen sooner.
Yes, just detach the disks manually from the portal (or powershell\cli\api\etc)
This is not supported, you should not do this. Scaling\Upgrading might fix this, but it might not
Okay, dont.
Related
Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade
I have a k8s cluster on Azure created with asc-engine. It has 4 windows agent nodes.
Recently 2 of the nodes went into a not-ready state and remained there for over a day. In an attempt to correct the situation I did a "kubectl delete node" command on both of the not-ready nodes, thinking that they would simply be restarted in the same way that a pod that is part of a deployment is restarted.
No such luck. The nodes no longer appear in the "kubectl get nodes" list. The virtual machines that are backing the nodes are still there and still running. I tried restarting the VMs thinking that this might cause them to self register, but no luck.
How do I get the nodes back as part of the k8s cluster? Otherwise, how do I recover from this situation? Worse case I can simply throw away the entire cluster and recreate it, but I really would like to simply fix what I have.
You can delete the virtual machines and rerun your acs engine template, that should bring the nodes back (although, i didnt really test your exact scenario). Or you could simply create a new cluster, not that it takes a lot of time, since you just need to run your template.
There is no way of recovering from deletion of object in k8s. Pretty sure they are purged from etcd as soon as you delete them.
Since my Cassandra cluster is replicated across three availability zones, I would like to backup only one availability zone to lower the backup costs. I have also experimented restoring nodes in a single availability zone and got back most of my data in a test environment. I would like to know if there are any drawbacks to this approach before deploying this solution in production. Is anyone following this approach in your production clusters?
Note: As I backup at regular intervals, I know that I may loose updates happened to other two AZ nodes quorum at the time of snapshot but that's not a problem.
You can backup only specific dc, or even nodes.
AFAIK, the only drawback is does your data consistent/up-to-date, and since you can afford to lose some data it shouldn't be a problem. And if you, for example performing writes with ALL consistency level, the data should be up-to-date on all nodes.
BUT, you must be sure that your data is indeed replicated between multi a-z, by playing with rack/dc properties or using ec2 switch that supports multi a-z.
EDIT:
Global Snapshot
Running nodetool snapshot is only run on a single node at a time.
This only creates a partial backup of your entire data. You will want
to run nodetool snapshot on all of the nodes in your cluster. But
it’s best to run them at the exact same time, so that you don’t have
fragmented data from a time perspective. You can do this a couple of
different ways. The first, is to use a parallel ssh program to
execute the nodetool snapshot command at the same time. The second,
is to create a cron job on each of the nodes to run at the same time.
The second assumes that your nodes have clocks that are in sync, which
Cassandra relies on as well.
Link to the page:
http://datascale.io/backing-up-cassandra-data/
I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.
With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.
Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?
Please refer to the documentation here.
Kubernetes (versions 1.5 or newer) will not delete Pods just because a
Node is unreachable. The Pods running on an unreachable Node enter the
‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter
these states when the user attempts graceful deletion of a Pod on an
unreachable Node. The only ways in which a Pod in such a state can be
removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding,
kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.
There is no way to differentiate between the following cases:
The instance being shut down without the Node object being deleted.
A network partition is introduced between the Node in question and the kubernetes-master.
Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.
On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.
However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.
Some more context is in the github issue.
is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/
only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ?
In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization actions to easily automate doing the installation during cluster deployment.
This way, you can easily reproduce the customizations without requiring manual involvement if you ever want, for example, to do the same setup on multiple concurrent Dataproc clusters, or want to change machine types, or receive sub-minor-version bug fixes that Dataproc releases occasionally.
There's indeed no officially supported way of pausing a Dataproc cluster at the moment, in large part simply because being able to have reproducible cluster deployments along with several other considerations listed below means that 99% of the time it's better to use initialization-action customizations instead of pausing a cluster in-place. That said, there are possible short-term hacks, such as going into the Google Compute Engine page, selecting the instances that are part of the Dataproc cluster you want to pause, and clicking "stop" without deleting them.
The Compute Engine hourly charges and Dataproc's per-vCPU charges are only incurred when the underlying instance is running, so while you've "stopped" the instances manually, you won't incur Dataproc or Compute Engine's instance-hour charges despite Dataproc still listing the cluster as "RUNNING", albeit with warnings that you'll see if you go to the "VM Instances" tab of the Dataproc cluster summary page.
You should then be able to just click "start" from the Google Compute Engine page page to have the cluster running again, but it's important to consider the following caveats:
The cluster may occasionally fail to start up into a healthy state again; anything using local SSDs already can't be stopped and started again cleanly, but beyond that, Hadoop daemons may have failed for whatever reason to flush something important to disk if the shutdown wasn't orderly, or even user-installed settings may have broken the startup process in unknown ways.
Even when VMs are "stopped", they depend on the underlying Persistent Disks remaining, so you'll continue to incur charges for those even while "paused"; if we assume $0.04 per GB-month, and a default 500GB disk per Dataproc node, that comes out to continuing to pay ~$0.028/hour per instance; generally your data will be more accessible and also cheaper to just put in Google Cloud Storage for long term storage rather than trying to keep it long-term on the Dataproc cluster's HDFS.
If you come to depend on a manual cluster setup too much, then it'll become much more difficult to re-do if you need to size up your cluster, or change machine types, or change zones, etc. In contrast, with Dataproc's initialization actions, you can use Dataproc's cluster scaling feature to resize your cluster and automatically run the initialization actions for new workers created.
Update
Dataproc recently launched the ability to stop and start clusters: https://cloud.google.com/dataproc/docs/guides/dataproc-start-stop