How can I restart a failed PBS job in cluster (qsub)? - python-3.x

I'm running a PBS job (python) in the cluster using qsub command. I'm curious to know how can I restart the same job from the step where it failed?
Any type of help will be highly appreciated.

Most likely, you cannot.
Restarting a job requires a checkpoint file.
For this, checkpointing support has to be explicitly configured on your HPC environment and then the job has to be submitted with additional command-line arguments.
See
http://docs.adaptivecomputing.com/torque/3-0-5/2.6jobcheckpoint.php

Related

Running spark application as scheduled job

We have written a program to fetch data from different sources, make modifications and write the modified data into a MySQL database. The program uses Apache spark for the ETL process, and makes use of spark Java API for this. Will be deploying the live application in Yarn or Kubernetes.
I need to run the program as a scheduled job, say with an interval of five minutes. Did some research and got different suggestions including this from blogs and articles, like plain cron job, AWS glue, Apache Airflow etc for scheduling a spark application. From my reading, it seems I can't run my code (Spark java API) using AWS Glue as it supports only Python and Scala.
Can someone provide insights or suggestions on this? Which is the best option for running a spark application (in Kubernates or Yarn) as a scheduled job?
Is there an option for this in Amazon EMR? Thanks in advance.
The best option I think and I used before is a cronjob, either :
From inside your container with crontab -e with logging seting in case of failure such as:
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/spark/bin
40 13 * * * . /PERSIST_DATA/02_SparkPi_Test_Spark_Submit.sh > /PERSIST_DATA/work-dir/cron_logging/cronSpark 2>&1
OR with a Kubernetes Cronjob, see here for different settings
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.

Is there a shell-command for spark that says what jobs are queued or running?

Environment: Spark 1.6.2; Linux 2.6.x (Red Hat 4.4.x); Hadoop 2.4.x.
I launched a job this morning through spark-submit but do not see the files it was supposed to write. I've read a bit about the web UI for monitoring spark jobs, but at this point, my only visibility into what is happening on the Hadoop cluster and HDFS is through a bash-shell terminal.
Question: what are the standard ways from the command-line to get a quick readout on spark jobs, and any log trail they might leave behind (during or after job execution)?
Thanks.
You can use yarn application -list

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

Spark job scheduler without YARN/MESOS

I want to schedule some spark jobs in specified time intervals. Every scheduler that I found works only with Yarn/Mesos(e.g. Oozie, Luigi, Azkaban, Airflow). I'm running Datastax and it doesn't have the option of running with Yarn or Mesos. I saw somewhere that maybe Oozie can work with Datastax but couldn't find any help for that. Is there any solution to this problem or the only one is to write a scheduler myself?

Resources