is there a way to execute the spark code in a zeppelin notebook, without having to do it interactively? I'm looking for something specific or if anyone could point me in the correct direction. Or alternatively, other ways to submit spark code, which is currently in a zeppelin notebook. The reason I can't use spark-submit is that there is no command line access due to security reasons.
Zeppelin provides REST API which, among other functions, can be used to run either individual paragraphs, either synchronously
http://[zeppelin-server]:[zeppelin-port]/api/notebook/run/[noteId]/[paragraphId]
or asynchronously
http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[noteId]/[paragraphId]
as well as whole notebook:
http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[noteId]
It is also possible to define CRON jobs, both from notebook itself and from the REST API.
Related
When writing an Application in Scala using Spark, when ran, is it a regular Scala application which "delegates the spark jobs to the spark cluster" and gets the desired results back ?
Or does it get completely compiled to something special consumed by a "spark engine" ?
It depends on the "deploy mode"
If you use local mode, then all actions occur locally, and you don't get any benefits from distribution that Spark is meant for. While it can be used to abstract different libraries and provide clean ways to process data via dataframes or ML, it's not really intended to be used like that
Instead, you can use cluster mode, in which your app just defines the steps to take, and then when submitted, everything happens in the remote cluster. In order to process data back in the driver, you need to use methods such as collect(), or otherwise download the results from remote file systems/databases
I develop spark sql to run against hadoop. Today I must run a spark job that invokes my query. Is there another way to do this? I find I spend too much time fixing side issues with running the jobs in spark. Ideally I want to be able to compose and execute Spark SQL queries directly against hadoop/hbase and bypass the spark job altogether. This would permit much more rapid iteration when debugging or trying alternate queries.
Note that my queries are often 100 lines long or more so working from the command line is challenging.
I have to do this from a WIndows workstation
The best thing you can use for HBase is to use Apache Phoenix. It provides an SQL interface.
As an example, on my last project I used NIFI with Phoenix to read and mutate HBase data. Worked great from command line. I did discover a bug in my usage of it.
See https://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html. You van use an SQL file. In addition you can use Hue.
Never tried the following for Windows, but it is possible. See https://community.cloudera.com/t5/Community-Articles/How-to-connect-a-Windows-JDBC-client-to-Cluster-enabled-with/ta-p/247787
I'm new to ETL development with PySpark and I've been writing my scripts as paragraphs on Apache Zeppelin Notebooks. I was curious what the typical flow was for a deployment process? How are you converting your code from a Zeppelin Notebook to your ETL pipeline?
Thanks!
Well that heavily depends on the sort of ETL that you're doing.
If you want to keep the scripts in the notebooks and you just need to orchestrate their execution then you have a couple options:
Use Zeppelin's built-in scheduler
Use cron to launch your notebooks via curl commands and Zeppelin's REST API
But if you already have an up-and-running workflow management tool like Apache Airflow then you can add new tasks that launch the aforementioned curl commands to trigger the notebooks (with Airflow, you can use BashOperator or PythonOperator), but keep in mind that you'll need some workarounds to have a sequential execution of different notes.
One major tech company that's betting heavily on notebooks is Netflix (you can take a look at this), and they developed a set of tools to improve the effeciency of notebook-based ETL pipelines, like Commuter and Papermill. They're more into Jupyter, so Zeppelin compatibility is still not provided, but the core concepts should be the same when working with Zeppelin.
For more on Netflix' notebook-based pipelines, you can refer to this article shared on their tech blog.
My team is trying to transition from Zeppelin to Jupyter for an application we've built, because Jupyter seems to have more momentum, more opportunities for customization, and be generally more flexible. However, there are a couple of things Zeppelin we haven't been able to equivalents for in Jupyter.
The main one is to have multi-lingual Spark support - is it possible in Jupyter to create a Spark data frame that's accessible via R, Scala, Python, and SQL, all within the same notebook? We've written a Scala Spark library to create data frames and hand them back to the user, and the user may want to use various languages to manipulate/interrogate the data frame once they get their hands on it.
Is Livy a solution to this in the Jupyter context, i.e. will it allow multiple connections (from the various language front-ends) to a common Spark back-end so they can manipulate the same data objects? I can't quite tell from Livy's web site whether a given connection only supports one language, or whether each session can have multiple connections to it.
If Livy isn't a good solution, can BeakerX fill this need? The BeakerX website says two of its main selling points are:
Polyglot magics and autotranslation, allowing you to access multiple languages in the same notebook, and seamlessly communicate between them;
Apache Spark integration including GUI configuration, status, progress, interrupt, and tables;
However, we haven't been able to use BeakerX to connect to anything other than a local Spark cluster, so we've been unable to verify how the polyglot implementation actually works. If we can get a connection to a Yarn cluster (e.g. an EMR cluster in AWS), would the polyglot support give us access to the same session using different languages?
Finally, if neither of those work, would a custom Magic work? Maybe something that would proxy requests through to other kernels, e.g. spark and pyspark and sparkr kernels? The problem I see with this approach is that I think each of those back-end kernels would have their own Spark context, but is there a way around that I'm not thinking of?
(I know SO questions aren't supposed to ask for opinions or recommendations, so what I'm really asking for here is whether a possible path to success actually exists for the three alternatives above, not necessarily which of them I should choose.)
Another possible is the SoS (Script of Scripts) polyglot notebook https://vatlab.github.io/sos-docs/index.html#documentation.
It supports multiple Jupyter kernels in one notebook. SoS has several natively supported languages (R, Ruby, Python 2 & 3, Matlab, SAS, etc). Scala is not supported natively, but it's possible to pass information to the Scala kernel and capture output. There's also a seemingly straightforward way to add a new language (already with a Jupyter kernel); see https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html
I am using Livy in my application. The way it works is any user can connect to a already established spark session using REST (asynchronous calls). We have a cluster on which Livy sends Scala code for execution. It is up to you whether you want to close the session after sending the scala code or not. If the session is open then any one having access can send Scala code once again to do further processing. I have not tried sending different languages in the same session created through Livy but I know that Livy supports 3 languages in interactive mode i.e. R, Python and Scala. So, theoretically you would be able to send code in any language for execution.
Hope it helps to some extent.
I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.