I'm new to ETL development with PySpark and I've been writing my scripts as paragraphs on Apache Zeppelin Notebooks. I was curious what the typical flow was for a deployment process? How are you converting your code from a Zeppelin Notebook to your ETL pipeline?
Thanks!
Well that heavily depends on the sort of ETL that you're doing.
If you want to keep the scripts in the notebooks and you just need to orchestrate their execution then you have a couple options:
Use Zeppelin's built-in scheduler
Use cron to launch your notebooks via curl commands and Zeppelin's REST API
But if you already have an up-and-running workflow management tool like Apache Airflow then you can add new tasks that launch the aforementioned curl commands to trigger the notebooks (with Airflow, you can use BashOperator or PythonOperator), but keep in mind that you'll need some workarounds to have a sequential execution of different notes.
One major tech company that's betting heavily on notebooks is Netflix (you can take a look at this), and they developed a set of tools to improve the effeciency of notebook-based ETL pipelines, like Commuter and Papermill. They're more into Jupyter, so Zeppelin compatibility is still not provided, but the core concepts should be the same when working with Zeppelin.
For more on Netflix' notebook-based pipelines, you can refer to this article shared on their tech blog.
Related
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
I have been building python pipelines using google cloud dataflow and apache beam for about a year. I am leaving the google cloud environment for a university cluster, which has spark installed. It looks like the spark runner is only for java (https://beam.apache.org/documentation/runners/spark/)? Are there any suggestions on how to run python apache beam pipelines outside of cloud dataflow?
As of right now, this is not yet possible, but portability across runners and languages is the highest priority and the most active area of development in Beam right now, and I think the portable Flink runner is very close to being able to run simple pipelines in Python, with portable Spark runner development to commence soon (and share lots of code with Flink). Stay tuned and follow the dev# mailing list!
I'm new to workflow engines and I'm in a need of fork a couple of my jobs. So I thought of using Apache Oozie for the purpose.I use Spark Standalone as my cluster manager.
But Most of the documents I gone through talks only about oozie on YARN.My question "Is Oozie workflow is supported for spark standalone and recommended?"
if so, can you share an example for the same? alternatively I would also like to know the possibilities of doing fork
in spark without using any of the workflow engines. What is the industry standard way of scheduling jobs apart from cron?
I am using Apache Spark in Bluemix.
I want to implement scheduler for sparksql jobs. I saw this link to a blog that describes scheduling. But its not clear how do I update the manifest. Maybe there is some other way to schedule my jobs.
The manifest file is to guide the deployment of cloud foundry (cf) apps. So in your case, sounds like you want to deploy your cf app that acts as a SparkSQL scheduler and use the manifest file to declare that your app doesn't need any of the web app routing stuff, or anything else for user-facing apps, because you just want to run a background scheduler. This is all well and good, and the cf docs will help you make that happen.
However, you cannot run a SparkSQL scheduler for the Bluemix Spark Service today because it only supports Jupyter notebooks through the Data-Analytics section of Bluemix; i.e., only a notebook UI. You need a Spark API you could drive from your scheduler cf app; e.g. spark-submit type thing where you can create your Spark context and then run programs, like SparkSQL you mention. This API is supposed to be coming to the Apache Spark Bluemix service.
UPDATE: spark-submit was made available sometime around the end of 1Q16. It is a shell script, but inside it makes REST calls via curl. REST API doesn't seem to yet be supported, but either you could call the script in your scheduler, or take the risk of calling the REST API directly and hope it doesn't changes and break you.
I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!