I'm new to workflow engines and I'm in a need of fork a couple of my jobs. So I thought of using Apache Oozie for the purpose.I use Spark Standalone as my cluster manager.
But Most of the documents I gone through talks only about oozie on YARN.My question "Is Oozie workflow is supported for spark standalone and recommended?"
if so, can you share an example for the same? alternatively I would also like to know the possibilities of doing fork
in spark without using any of the workflow engines. What is the industry standard way of scheduling jobs apart from cron?
Related
I'm new to ETL development with PySpark and I've been writing my scripts as paragraphs on Apache Zeppelin Notebooks. I was curious what the typical flow was for a deployment process? How are you converting your code from a Zeppelin Notebook to your ETL pipeline?
Thanks!
Well that heavily depends on the sort of ETL that you're doing.
If you want to keep the scripts in the notebooks and you just need to orchestrate their execution then you have a couple options:
Use Zeppelin's built-in scheduler
Use cron to launch your notebooks via curl commands and Zeppelin's REST API
But if you already have an up-and-running workflow management tool like Apache Airflow then you can add new tasks that launch the aforementioned curl commands to trigger the notebooks (with Airflow, you can use BashOperator or PythonOperator), but keep in mind that you'll need some workarounds to have a sequential execution of different notes.
One major tech company that's betting heavily on notebooks is Netflix (you can take a look at this), and they developed a set of tools to improve the effeciency of notebook-based ETL pipelines, like Commuter and Papermill. They're more into Jupyter, so Zeppelin compatibility is still not provided, but the core concepts should be the same when working with Zeppelin.
For more on Netflix' notebook-based pipelines, you can refer to this article shared on their tech blog.
I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.
I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?
Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.
I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!
I am very excited that HDInsight switched to Hadoop version 2, which supports Apache Spark through YARN. Apache Spark is a much better fitting parallel programming paradigm than MapReduce for the task that I want to perform.
I was unable to find any documentation however on how to do remote job submission of a Apache Spark job to my HDInsight cluster. For remote job submission of standard MapReduce jobs I know that there are several REST endpoints like Templeton and Oozie. But as for as I was able to find, running Spark jobs is not possible through Templeton. I did find it to be possible to incorporate Spark jobs into Oozie, but I've read that this is a very tedious thing to do and also I've read some reports of job failure detection not working in this case.
Probably there must be a more appropriate way to submit Spark jobs. Does anyone know how to do remote job submissions of Apache Spark jobs to HDInsight?
Many thanks in advance!
You can install spark on a hdinsight cluster. You have to do it at by creating a custom cluster and adding an action script that will install Spark on the cluster at the time it creates the VMs for the Cluster.
To install with an action script on cluster install is pretty easy, you can do it in C# or powershell by adding a few lines of code to a standard custom create cluster script/program.
powershell:
# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1
C#:
// ADD THE SCRIPT ACTION TO INSTALL SPARK
clusterInfo.ConfigActions.Add(new ScriptAction(
"Install Spark", // Name of the config action
new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on
new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark
null //because the script used does not require any parameters.
));
you can then RDP into the headnode and run use the spark-shell or use spark-submit to run jobs. I am not sure how would run spark job and not rdp into the the headnode but that is an other question.
I also asked the same question with Azure guys. Following is the solution from them:
"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li
Currently, this functionality is not supported. One workaround is to build job submission web service yourself:
Create Scala web service that will use Spark APIs to start jobs on the cluster.
Host this web service in the VM inside the same VNet as the cluster.
Expose web service end-point externally through some authentication scheme. You can also employ intermediate map reduce job, it would take longer though.
You might consider using Brisk (https://brisk.elastatools.com) which offers Spark on Azure as a provisioned service (with support available). There's a free tier and it lets you access blob storage with a wasb://path/to/files just like HDInsight.
It doesn't sit on YARN; instead it is a lightweight and Azure oriented distribution of Spark.
Disclaimer: I work on the project!
Best wishes,
Andy