Slurm: Show node state of POWERED_DOWN nodes in a cloud scheduling system - slurm

Problem Description
Let's say I have two CLOUD nodes running. Now for some reason they fail and go to DOWN and they POWER_DOWN. Now sinfo will no longer detect them. You can still change the node state using sudo scontrol update NodeName=$NODE_NAME state=$BECOME_STATE reason=$REASON and there are clear signs that a node state is still held. For example: Slurm will not schedule jobs to DOWN nodes even if they are POWER_DOWN, so the node being DOWN must be held by slurm somewhere.
Using scontrol show $NODE_NAME also doesn't work when the node is in POWER_DOWN.
Question
Is there any possibility to retrieve the node state of a node which is not POWER_UP i.e. not shown by sinfo due to not being "online"?
Additional Information
In our Cloud Scheduling instances are created once they are needed and deleted once they are no longer deleted. This means that slurm stores that a node is DOWN while no real instance behind it exists anymore. If that node wouldn't be marked DOWN and a job would be scheduled towards it at a later time, it would simply start an instance and run on that new instance. I am just stating this to be maximum explicit.

This can be solved by adding
PrivateData=cloud
to your slurm configuration. This is only mentioned here: https://slurm.schedmd.com/elastic_computing.html
Powered down nodes will then be displayed with a ~ postfix.

Related

In a clustered environment, if there is items.xml change, is it enough to do Update Running System in one node and just clear cache the other nodes?

In a clustered environment, if there is items.xml change, is it enough to do Update Running System in one node and just clear cache the other nodes?
​Assume that I have Node1 and Node2. If I add an attribute in items.xml and build-and-deploy it to Node1 and Node2, is it enough for me to do Update Running System in Node1, and just Clear Cache Node2? Or, do I also need to do Update Running System in Node2?
Ideally, its good to restart other nodes once you are done with an update running system. If zero downtime is not the case, then take downtime to make on node ready with an update system and then make other nodes ready.
Another way:
Take node1 out of the cluster, get it ready with all configuration, update system, etc. Add it back to the cluster and take all other nodes down. Do deployment on other remaining nodes and make them up. Here you do not need to update the running system or cache clear.
if downtime is really critical for the business, then refer rolling update on the cluster to implement the proper solution.

How to know if a machine in a Spark cluster 'participate's a job

I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?
I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

StatefulSet: pods stuck in unknown state

I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.
With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.
Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?
Please refer to the documentation here.
Kubernetes (versions 1.5 or newer) will not delete Pods just because a
Node is unreachable. The Pods running on an unreachable Node enter the
‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter
these states when the user attempts graceful deletion of a Pod on an
unreachable Node. The only ways in which a Pod in such a state can be
removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding,
kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.
There is no way to differentiate between the following cases:
The instance being shut down without the Node object being deleted.
A network partition is introduced between the Node in question and the kubernetes-master.
Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.
On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.
However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.
Some more context is in the github issue.

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Adding a new node to existing cluster

Is it possible to add a new node to an existing cluster in cassandra 1.2 without running nodetool cleanup on each individual node once data has been added?
It probably isn't but I need to ask because I'm trying to create an application where each user's machine is a server allowing for endless scaling.
Any advice would be appreciated.
Yes, it is possible. But you should be aware of the side-effects of not doing so.
nodetool cleanup purges keys that are no longer allocated to that node. According to the Apache docs, these keys count against the allocated data for that node, which can cause the auto bootstrap process for the next node to not properly balance the ring. So depending on how you are bringing new user machines into the ring, this may or may not be a problem.
Also keep in mind that nodetool cleanup only needs to be run on nodes that lost keyspace to the new node - i.e. adjacent nodes, not all nodes, in the cluster.

Resources