Running spark application as scheduled job - apache-spark

We have written a program to fetch data from different sources, make modifications and write the modified data into a MySQL database. The program uses Apache spark for the ETL process, and makes use of spark Java API for this. Will be deploying the live application in Yarn or Kubernetes.
I need to run the program as a scheduled job, say with an interval of five minutes. Did some research and got different suggestions including this from blogs and articles, like plain cron job, AWS glue, Apache Airflow etc for scheduling a spark application. From my reading, it seems I can't run my code (Spark java API) using AWS Glue as it supports only Python and Scala.
Can someone provide insights or suggestions on this? Which is the best option for running a spark application (in Kubernates or Yarn) as a scheduled job?
Is there an option for this in Amazon EMR? Thanks in advance.

The best option I think and I used before is a cronjob, either :
From inside your container with crontab -e with logging seting in case of failure such as:
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/spark/bin
40 13 * * * . /PERSIST_DATA/02_SparkPi_Test_Spark_Submit.sh > /PERSIST_DATA/work-dir/cron_logging/cronSpark 2>&1
OR with a Kubernetes Cronjob, see here for different settings
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

Related

Migrating nodejs jobs to Airflow

I am looking at migrating several nodejs jobs to apache airflow.
These jobs log to the standard output. I am new to Airflow, and have set it up running in docker. Ideally, we would updated these jobs to use connections provided by airflow, but i'm not sure that will be possible.
We have succeeded in running the job by installing nodejs into the bash operator:
t1 = BashOperator(
task_id='task_1',
bash_command='/usr/bin/nodejs /usr/local/airflow/dags/test.js',
dag=dag)
Would this be a good approach? Or would writing a nodejs operator be a better approach?
I also thought of putting the node code behind an HTTP service which would be my preferred approach, but then we loose the logs.
Any thoughts on how best to architect this in Airflow?
The bash approach is feasible, but it is going to be very hard to maintain the nodejs dependencies.
I would migrate the code to containers and use docker_operator / KubernetesPodOperator afterwards.

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

how to know remotely if spark job is running on cluster

I am running spark job on ec2 cluster, I have a trigger that submits job periodically. I do not want to submit job if one job is already running on cluster. Is there any api that can give me this information?
Spark, and by extension, Spark Streaming offer an operational REST API at http://<host>:4040/api/v1
Consulting the status of the current application will give you the information sought.
Check the documentation: https://spark.apache.org/docs/2.1.0/monitoring.html#rest-api
you can consult the UI to see the status
eg.
If you run locally, take a look at the localhost:4040

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

Apache Spark application deployment best practices

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!

Resources