I accidentally removed a job submission script for a Slurm job in terminal using rm command. As far as I know there are no (relatively easy) ways of recovering that file anymore, and I hadn't saved it anywhere. I have used that job submission script many many times before, so there are a lot of Slurm job submissions (all of them finished) that have used it. Is it possible to recover that job script from an old finished job somehow?
If Slurm is configured with the ElasticSearch plugin, then you will find the submission script for all completed jobs in the ElasticSearch instance used in the setup.
Another option is to install sarchive
Related
Killing Spark job using command Prompt
This is the thread that I hoped would answer my question. But all four answers explain how to kill the entire application.
How can I stop a job? Like a count for example?
I can do it from the Spark Web UI by clicking "kill" on the respective job. I suppose it must be possible to list running jobs and interact with them also directly via CLI.
Practically speaking I am working in a Notebook with PySpark on a Glue endpoint. If I kill the application the entire endpoint dies and I have to spin up a new cluster. I just want to stop a job. Cancelling it within the Notebook will just detach synchronization and the job keeps running, blocking any further commands from being executed.
Spark History Server provides REST API interface. Unfortunately, it only exposes monitoring capabilities for applications, jobs, stages, etc.
There is also a REST Submission interface that provides capabilities to submit, kill and check up on status of the applications. It is undocumented AFAIK, and is only supported on Spark standalone and Mesos clusters, no YARN. (Thats why there is no "kill" link in Jobs UI screen for Spark on YARN, I guess.)
So you can try using that "hidden" API, but if you know your application's Spark UI URL and job id of a job you want to kill, the easier way is something like:
$ curl -G http://<Spark-Application-UI-host:port>/jobs/job/kill/?id=<job_id>
Since I don't work with Glue I'd be interested to find out myself how its going to react, because the kill normally results in org.apache.spark.SparkException: Job <job_id> cancelled.
building on the answer by mazaneicha it appears that, for Spark 2.4.6 in standalone mode for jobs submitted in client mode, the curl request to kill an app with a known applicationID is
curl -d "id=<your_appID>&terminate=true" -X POST <your_spark_master_url>/app/kill/
We had a similar problem with people not disconnecting their notebooks from the cluster and hence hogging resources.
We get the list of running applications by parsing the webUI. I'm pretty sure there's less painful ways to manage a Spark cluster..
list the job in linux and kill it.
I would do
ps -ef |grep spark-submit
if it was started using spark-submit. Get the PID from the output and then
kill -9
Kill running job by:
open Spark application UI.
Go to jobs tab.
Find job among running jobs.
Click kill link and confirm.
We are using DataStax Spark 6.0.
We are submitting jobs using crontab to run every 5 mins. We wrote script to find if it is running to avoid duplicate submission of same application. Is there a way to stop job submission or keep job in Queue in Spark level, to avoid duplicate jobs with same application.
Thanks
Rakesh
I tried using Crontab only
You can use oozie to shedule your spark job .
I'm running a PBS job (python) in the cluster using qsub command. I'm curious to know how can I restart the same job from the step where it failed?
Any type of help will be highly appreciated.
Most likely, you cannot.
Restarting a job requires a checkpoint file.
For this, checkpointing support has to be explicitly configured on your HPC environment and then the job has to be submitted with additional command-line arguments.
See
http://docs.adaptivecomputing.com/torque/3-0-5/2.6jobcheckpoint.php
I use PBS job arrays to submit a number of jobs. Sometimes a small number of jobs get screwed up and not been ran successfully. Is there a way to automatically detect the failed jobs and restart them?
pbs_server supports automatic_requeue_exit_code:
an exit code, defined by the admin, that tells pbs_server to requeue the job instead of considering it as completed. This allows the user to add some additional checks that the job can run meaningfully, and if not, then the job script exits with the specified code to be requeued.
There is also a provision for requeuing jobs in the case where the prologue fails (see the prologue/epilogue script documentation).
There are probably more sophisticated ways of doing this, but they would fall outside the realm of built-in Torque options.
I am using /torque/4.2.5 to schedule my jobs and I need to find a way to make a copy of my PBS launching script that I used for jobs that are currently queueing or running. The plan is to make a copy of that launch script in the output folder.
TORQUE has a job logging feature that can be configured to record job scripts used at launch time.
EDIT: if you have administrator privileges and want to read the file that is stored you can inspect TORQUE_HOME/server_priv/jobid.SC
TORQUE_HOME is usually /var/spool/torque but is configurable.