Does Kubernetes reschedule pods to other minions if the minions join latter? - node.js

I was running a test with initially a minion node and a master node. I created 5 pods on the cluster and later on 2 minion nodes joined the cluster.
So the problem I faced was that all the pods were only scheduled on the master and minion nodes. They were not re-scheduled to new nodes so as to divide the whole resources. Due to which my new minion nodes were just sitting idle and didn't do any processing.
Is there anything specially to be run to make this happen ?

Not really. The scheduler is called whenever something needs to be scheduled, so unless you deploy new replicas of the pod, the scheduler won't be bothered again.
Whenever you want to schedule something, like creating a Deployment or a Pod, the scheduler looks at the available resources to place the Pods where it thinks is best. Next time you schedule something, it will take into account the new minions added to the cluster. Or if your pods are created via a Deployment object, you could try deleting one Pod, so the ReplicationController will create a new Pod and the scheduler may choose one of the new minions.
The documentation also recommends creating a Service before creating a Deployment`, so the scheduler will spread the pods better among the minions.

Related

Kubernetes Jobs or Pods for completion Jobs with auto scaling

I have CPU Intensive Jobs/tasks,
Need to run them in kubernetes, below is the process of job/task
We get request in terms queue or API Call
POd should be created and process the task ( few Jobs may run in minutes, few in hours)
delete pod once task completed
This should happen in scale, if more jobs in queue, create more jobs (Max 10, 20, 30 2e should define it)
I am used KEDA, POD will be created and after Job completion it is going crashloopbback, It is default behaviour in POD life cycle, because it try to recreate pod since restart policy is set to Always. We have other options like OnFailure, Never, But I read it Kubernetes Jobs are more suitable
Which is the better option Kubernetes Pods or Jobs for above task, we should consider scaling POds and also required scale kubernetes nodes (Cloud vendors supports it) based on usage and numbers of tasks in queue.
KEDA ScaledJobs are best for such scenarios and can be triggered through Queue, Storage, etc. (the currently available scalers can be found here)

Azure Kubernetes Failover with PersistentVolumes

I am currently testing how Azure Kubernetes handles failover for StatefulSets. I simulated a network partition by running sudo iptables -A INPUT -j DROP on one of my nodes, not perfect but good enough to test some things.
1). How can I reuse disks that are mounted to a failed node? Is there a way to manually release the disk and make it available to the rescheduled pod? It takes forever for the resources to be released after doing a force delete, sometimes this takes over an hour.
2). If I delete a node from the cluster all the resources are released after a certain amount of time. The problem is that in the Azure dashboard it still displays my cluster as using 3 nodes even if I have deleted one. Is there a way to manually add the deleted node back in or do I need to rebuild the cluster each time?
3). I most definitely do not want to use ReadWriteMany.
Basically what I want is for my StatefulSet pods to terminate and have the associated disks detach and then reschedule on a new node in the event of a network partition or a node failure. I know the pods will terminate in the event of a recovery from a network partition but I want control over the process myself or at least have it happen sooner.
Yes, just detach the disks manually from the portal (or powershell\cli\api\etc)
This is not supported, you should not do this. Scaling\Upgrading might fix this, but it might not
Okay, dont.

Recovering from Kubernetes node failure running Cassandra

I'm looking for a good solution to replace dead Kubernetes worker node that was running Cassandra in Kubernetes.
Scenario:
Cassandra cluster built from 3 pods
Failure occurs on one of the Kubernetes worker nodes
Replacement node is joining the cluster
New pod from StatefulSet is scheduled on new node
As pod IP address has changed, new pod is visible as new Cassandra node (4 nodes in cluster in total) and is unable to bootstrap until
the dead one is removed.
It's very difficult to follow the official procedure, as Cassandra is running as StatefulSet.
One completely hacky workaround I've found is to use ConfigMap to supply JAVA_OPTS. As changing ConfigMap doesn't recreate pods (yet), you can manipulate running pods in such way that you will be able to follow the procedure.
However, that's, as I mentioned, super hacky. I'm wondering if anyone is running Cassandra on top of Kubernetes and has a better idea how to deal with such failure?
Jetstack navigator supports this, but it's currently in alpha:
https://github.com/jetstack/navigator
unable to bootstrap until the dead one is removed.
Why is that?
I use the statefulset and I'm able to kill a pod and have a new one join in

StatefulSet: pods stuck in unknown state

I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.
With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.
Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?
Please refer to the documentation here.
Kubernetes (versions 1.5 or newer) will not delete Pods just because a
Node is unreachable. The Pods running on an unreachable Node enter the
‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter
these states when the user attempts graceful deletion of a Pod on an
unreachable Node. The only ways in which a Pod in such a state can be
removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding,
kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.
There is no way to differentiate between the following cases:
The instance being shut down without the Node object being deleted.
A network partition is introduced between the Node in question and the kubernetes-master.
Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.
On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.
However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.
Some more context is in the github issue.

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Resources