TORQUE- qsub: job gets rejected if there are not enough number of nodes - pbs

When I submit a job using qsub job gets rejected if not enough nodes available. Is there any config that says to QUEUE(without running) the job instead of rejecting.

Related

Callback on host node after slurm job has been allocated

I'd like to do two things in sequence:
Submit a job with sbatch
Once the job has been allocated, retrieve the hostname of the allocated node and, using that name, execute a second command on the host (login) node.
Step 2 is the hard part. I suppose I could write a Python script that polls squeue. Is there a better way? Can I set up a callback that Slurm will execute once a job starts running?
(In case you're curious, my motivation is to launch a Jupyter notebook on a compute node and automatically set up ssh port forwarding as in this blog post.)

DEADLINE_EXCEEDED when Airflow runs spark jobs

We are running an Airflow pipeline that executes several spark jobs on dataproc. One of these jobs takes 3-4 hours to complete. We are seeing the following error messages in the Airflow logs, even though the spark job succeeds:
HttpError 504 when requesting https://dataproc.googleapis.com/v1/projects/our-project/regions/global/jobs/our_job_20170930_5c0ee1ff?alt=json returned "Deadline expired before operation could complete
This causes Airflow to retry the task (which actually succeeded).
I see in the documentation that the DEADLINE_EXCEEDED error "may be returned even if the operation has completed successfully. For example, a successful response from a server could have been delayed long enough for the deadline to expire".
So my question is: is there any configuration parameter we can tweak to avoid these timeouts and retries?
EDIT:
In the job output in Dataproc we see a log message, just before it finishes:
17/10/06 08:33:51 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(10,WrappedArray())
Which further gives us the impression that the spark-job ran to completion, and then when it tried to report back that it's done there was nobody listening.

Spark standalone master HA jobs in WAITING status

We are trying to setup HA on spark standalone master using zookeeper.
We have two zookeeper hosts which we are using for spark ha as well.
Configured following thing in spark-env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server2:2181"
Started both the masters.
started shell and status of the job is RUNNING.
master1 is in ALIVE and master2 is in STANDBY status.
Killed the master1 and master2 has been picked up and all the workers appeared alive in master2.
The shell which is already running has been moved to new master. However, the status is in WAITING status and executors are in LOADING status.
No error in worker log and executor log, except notification that connected to new master.
I could see the worker re-registered, but the executor does not seems to be started. Is there any thing that i am missing.?
My spark version is 1.5.0

HDFS bytes read and written by a Job launched through YARN

I was wondering if there is anyway of checking the number of HDFS bytes read or written by an Spark application launched through YARN. For example, if I check the jobs completed in YARN:
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):2
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1451137492021_0002 com.abrandon.upm.GenerateRandomText SPARK alvarobrandon default FINISHED SUCCEEDED 100% N/A
The idea is to be able to monitor the number of bytes that application_1451137492021_0002 has read or written. I have check the datanode logs but I can only find traces of non mapreduce jobs and of course no trace of this particular application.

Set slurm to send email after all my jobs are done?

Is it possible to do that without writing my own daemon? I know slurm can send you and email for each job, but I would like a single email when I have no more pending or running jobs.
One option is to submit an empty job just to send the email, and ask Slurm to run that job the latest.
You can do that using the --dependency singleton option. From the documentation:
singleton This job can begin execution after any previously launched
jobs sharing the same job name and user have terminated.
So you need to name all your jobs the same name (--name=commonname), and you should request the minimum resources possible to make sure that this job is not delayed further when all your other jobs are finished.

Resources