Application job submission with out duplication - apache-spark

We are using DataStax Spark 6.0.
We are submitting jobs using crontab to run every 5 mins. We wrote script to find if it is running to avoid duplicate submission of same application. Is there a way to stop job submission or keep job in Queue in Spark level, to avoid duplicate jobs with same application.
Thanks
Rakesh
I tried using Crontab only

You can use oozie to shedule your spark job .

Related

Scheduling spark job via marathon

I want to schedule spark job to run on daily basis via marathon. I am using mesos as cluster manager.
How to schedule a job to run only once a day via marathon. Right now the job keeps on running again and again once it's finished.
There is no way to schedule periodic job on Marathon. You need to use another framework
Metronome
Chronos
Singularity
Aurora

Recover Slurm job submission script from an old job?

I accidentally removed a job submission script for a Slurm job in terminal using rm command. As far as I know there are no (relatively easy) ways of recovering that file anymore, and I hadn't saved it anywhere. I have used that job submission script many many times before, so there are a lot of Slurm job submissions (all of them finished) that have used it. Is it possible to recover that job script from an old finished job somehow?
If Slurm is configured with the ElasticSearch plugin, then you will find the submission script for all completed jobs in the ElasticSearch instance used in the setup.
Another option is to install sarchive

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

sparksession getorcreate method stuck till other spark job is running

I'm trying to run spark job from oozie and I have two action in oozie workflow and they are running parallely. However when Oozie started one other is stuck till one is completed at the sparksession.getorcreate() method.
self.spark_session = SparkSession.builder.master(master).appName(appName).config(conf=conf).enableHiveSupport().getOrCreate()
If you run on YARN - open Resource Manager UI and check if there is enough resources for all the jobs (e.g. vCores/Memory).

Spark job scheduler without YARN/MESOS

I want to schedule some spark jobs in specified time intervals. Every scheduler that I found works only with Yarn/Mesos(e.g. Oozie, Luigi, Azkaban, Airflow). I'm running Datastax and it doesn't have the option of running with Yarn or Mesos. I saw somewhere that maybe Oozie can work with Datastax but couldn't find any help for that. Is there any solution to this problem or the only one is to write a scheduler myself?

Resources