A big project on Apache Zeppelin - apache-spark

I have a streaming job that need to be launched through Zeppelin. However, it is very big project. As far as I know, the program launched on Zeppelin is notebook style, with several hundreds lines of code at most.
My code has several thousands of lines, with many classes and objects. How can I launch such big project on Zeppelin.
For some particular requirement, I have to do this......

The correct and supported way to do this, is to extract an API to your code compile it as a jar library, and use Zeppelin's dependency handling [see interpreter settings] to add the jar of your existing project to Zeppelin. Then you can call the complex methods of your project from within Zeppelin, using compact and notebook-compatible bits of code.

Related

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

Jupyter as Zeppelin replacement: multi-lingual Spark

My team is trying to transition from Zeppelin to Jupyter for an application we've built, because Jupyter seems to have more momentum, more opportunities for customization, and be generally more flexible. However, there are a couple of things Zeppelin we haven't been able to equivalents for in Jupyter.
The main one is to have multi-lingual Spark support - is it possible in Jupyter to create a Spark data frame that's accessible via R, Scala, Python, and SQL, all within the same notebook? We've written a Scala Spark library to create data frames and hand them back to the user, and the user may want to use various languages to manipulate/interrogate the data frame once they get their hands on it.
Is Livy a solution to this in the Jupyter context, i.e. will it allow multiple connections (from the various language front-ends) to a common Spark back-end so they can manipulate the same data objects? I can't quite tell from Livy's web site whether a given connection only supports one language, or whether each session can have multiple connections to it.
If Livy isn't a good solution, can BeakerX fill this need? The BeakerX website says two of its main selling points are:
Polyglot magics and autotranslation, allowing you to access multiple languages in the same notebook, and seamlessly communicate between them;
Apache Spark integration including GUI configuration, status, progress, interrupt, and tables;
However, we haven't been able to use BeakerX to connect to anything other than a local Spark cluster, so we've been unable to verify how the polyglot implementation actually works. If we can get a connection to a Yarn cluster (e.g. an EMR cluster in AWS), would the polyglot support give us access to the same session using different languages?
Finally, if neither of those work, would a custom Magic work? Maybe something that would proxy requests through to other kernels, e.g. spark and pyspark and sparkr kernels? The problem I see with this approach is that I think each of those back-end kernels would have their own Spark context, but is there a way around that I'm not thinking of?
(I know SO questions aren't supposed to ask for opinions or recommendations, so what I'm really asking for here is whether a possible path to success actually exists for the three alternatives above, not necessarily which of them I should choose.)
Another possible is the SoS (Script of Scripts) polyglot notebook https://vatlab.github.io/sos-docs/index.html#documentation.
It supports multiple Jupyter kernels in one notebook. SoS has several natively supported languages (R, Ruby, Python 2 & 3, Matlab, SAS, etc). Scala is not supported natively, but it's possible to pass information to the Scala kernel and capture output. There's also a seemingly straightforward way to add a new language (already with a Jupyter kernel); see https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html
I am using Livy in my application. The way it works is any user can connect to a already established spark session using REST (asynchronous calls). We have a cluster on which Livy sends Scala code for execution. It is up to you whether you want to close the session after sending the scala code or not. If the session is open then any one having access can send Scala code once again to do further processing. I have not tried sending different languages in the same session created through Livy but I know that Livy supports 3 languages in interactive mode i.e. R, Python and Scala. So, theoretically you would be able to send code in any language for execution.
Hope it helps to some extent.

How to modify spark source code and build

I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!
At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

excluding hadoop from spark build

I'm modifying hdfs module inside hadoop, and would like too see the reflection while i'm running spark on top of it, but I still see the native hadoop behaviour. I've checked and saw Spark is building a really fat jar file, which contains all hadoop classes (using hadoop profile defined in maven), and deploy it over all workers. I also tried bigtop-dist, to exclude hadoop classes but see no effect.
Is it possible to do such a thing easily, for example by small modifications inside the maven file?
I believe you are looking for the provided scope on maven artifacts. It allows you to exclude certain classes in packaging while allowing you to compile against them (with the expectation that your runtime environment will provide them at their correct respective versions). See here and here for further discussion.

Resources