Job dependencies across several clusters - slurm

Let's say I submit a job to several cluster queues with --clusters=a,b,c and sbatch says that it was submitted with the id ID on cluster a.
Can I submit another job with the first one as dependency, but on a different cluster? Something like --dependency=afterok:ID and --clusters=b,c. To me this does not seem possible as ID is only relevant within the queue of cluster a, but I want to be sure.

If the clusters are organised as a federation, job IDs are globally unique so this will work. If they are not part of a federation, job IDs are relevant only in the scope of each cluster, and your intuition is correct. See this doc for further information.

Related

What will happen if my driver or executor is lost in Spark while running an spark-application?

Three questions of similarity:
what will happen if one my one executor is lost.
what will happen if my driver is lost.
What will happen in case of stage failure.
In all the above cases, are they recoverable? If yes, how to recover. Is there any option in "SparkConf", setting which these can be prevented from?
Thanks.
Spark use job scheduling. DAGScheduler is implemented by cluster managers (Standalone, YARN, Mesos), and your cluster manager can re-schedule the failed task.
For example, if you use YARN, try tweaking spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts. Also, you can try to manually track jobs using the HTTP API: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html
If you want to recover from logical errors, you can try checkpointing (saving records to HDFS for later use): https://mallikarjuna_g.gitbooks.io/spark/content/spark-streaming/spark-streaming-checkpointing.html. (For really long and important pipelines I recommend saving your data in normal files instead of checkpoints!).
Configuring high-available clusters is a more complex task than tweaking 1 setting in SparkConf. You can try to implement different scenarios and return with more detailed questions. As a first step, you can try to run everything on YARN.

Is it possible to know the resources used by a specific Spark job?

I'm drawing on ideas of using a multi tenant Spark cluster. The cluster execute jobs on demand for a specific tenant.
Is it possible to "know" the specific resources used by a specific job (for payment reasons)? E.g. if a job requires that several nodes in kubernetes is automatically allocated is it then possible to track which Spark jobs (and tenant at the end) that initiated these resource allocations? Or, jobs are always evenly spread out on allocated resources?
Tried to find information at the Apache Spark site and else where on the internet without success.
See https://spark.apache.org/docs/latest/monitoring.html
You can save data from Spark History Server as json and then write your own resource calc stuff.
It is Spark App you mean.

How Spark Scheduler works in K8s environment?

Here is the scenario:
Running a 10-node Spark cluster in a K8s environment (eks).
I want customer A uses the first 5 nodes (node1,2,3,4,5) and customer B uses the next 5 nodes all the time.
I don't think K8s Affinity can help me here because Spark Scheduler has mind of it's own.
Spark node != kubernetes node so the (anti)affinity that gives hints to k8s scheduler how to schedule particular pods is out of the question.
Can't you just deploy two standalone spark clusters and give customer A access to the first cluster (say, 1 master and 5 workers) and similarly for the customer B?
If this scenario is something you would like to try, check also my https://github.com/radanalyticsio/spark-operator

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

How can Cassandra have no single point of failure when it has no master and data is not replicated but distributed?

Perhaps I am misunderstanding, but the Apache Cassandra Wikipedia article says:
"Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request."
How can each node contain different data, but there be no single point of failure? For instance, I would imagine that in this senario, if a node when down containing the record I was querying, then a different node would pickup that request, however, it would not have the data to satisfy it..since that data was on the node that went down..
Can someone clear this up for me?
Thanks!
Cassandra clusters do replicate data across the nodes. The specific number of replicas is configurable, but generally production clusters will use a replication factor of 3. This means that a given row will be stored on three different machines in the cluster. See the reference documentation on replication for more details.
In terms of servicing requests, if a node receives a request for data that it does not have it will forward that request to the nodes that do own the data.

Resources