Apache Spark app workflow - apache-spark

How do You organize the Spark development workflow?
My way:
Local hadoop/yarn service.
Local spark service.
Intellij on one screen
Terminal with running sbt console
After I change Spark app code, I switch to terminal and run "package" to compile to jar and "submitSpark" which is stb task that runs spark-submit
Wait for exception in sbt console :)
I also tried to work with spark-shell:
Run shell and load previously written app.
Write line in shell
Evaluate it
If it's fine copy to IDE
After few 2,3,4, paste code to IDE, compile spark app and start again
Is there any way to develop Spark apps faster?

I develop the core logic of our Spark jobs using an interactive environment for rapid prototyping. We use the Spark Notebook running against a development cluster for that purpose.
Once I've prototyped the logic and it's working as expected, I "industrialize" the code in a Scala project, with the classical build lifecycle: create tests; build, package and create artifacts by Jenkins.

I found writing scripts and using :load / :copy streamlined things a bit since I didn't need to package anything. If you do use sbt I suggest you start it and use ~ package such that it automatically packages the jar when changes are made. Eventually of course everything will end up in an application jar, this is for prototyping and exploring.
Local Spark
Vim
Spark-Shell
APIs
Console

We develop ours applications using an IDE (Intellij because we code your spark's applications in Scala) using scalaTest for testing.
In those tests we use local[*] as SparkMaster in order to allow the debugging.
For integration testing we used Jenkins and we launch an "end to end" script as an Scala application.
I hope this will be useful

Related

How to run Spark processes in develop environment using a cluster?

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

SBT console vs Spark-Shell for interactive development

I'm wondering if there are any important differences between using SBT console and Spark-shell for interactively developing new code for a Spark project (notebooks are not really an option w/ the server firewalls).
Both can import project dependencies, but for me SBT is a little more convenient. SBT automatically brings in all the dependencies in build.sbt and spark-shell can use the --jar, --packages, and --repositories arguments in the command line.
SBT has the handy initialCommands setting to automatically run lines at startup. I use this for initializing the SparkContext.
Are there others?
With SBT you need not install SPARK itself theoretically.
I use databricks.
From my experience sbt calls external jars innately spark shell calls series of imports and contexts innately. I prefer spark shell because it follows the standard you need to adhere to when build the spark submit session.
For running the code in production you need to build the code into jars, calling them via spark submit. To build that you need to package it via sbt (compilation check) and run the spark submit submit call (logic check).
You can develope using either tool but you should code as if you did not have the advantages of sbt (calling the jars) and spark shell (calling the imports and contexts) because spark submit doesn't do either.

Why does HDInsight cluster does not come with pre-installed Scala?

on HDInsight's masternode, $scala -verion returns an error. It is easily installed via
$apt-get install scala
but shouldn't scala be installed there by default?
Thank you for suggestion. What's the scenario where you need scala to be directly installed on the node? For example, in spark there are couple of other common scenarios that already work:
Running Spark commands in command line. This is accomplished through spark-shell which has built-in scala interpreter.
Building spark project. This is ussually done through maven or sbt project definition file. Those tools would automatically download correct scala version and compiler based on the project dependencies.
As you said it's not hard to preinstall scala, but we would like to understand the need to do that. In the discussions with customers this didn't come up before.

JavaSparkContext - jarOfClass or jarOfObject doesnt work

Hi I am trying to run my spark service against cluster. As it turns out I have to do setJars and set my applicaiton jar in there. If I do it using physical path like following it works
conf.setJars(new String[]{"/path/to/jar/Sample.jar"});
but If i try to use JavaSparkContext (or SparkContext) api jarOfClass or jarOfObject it doesnt work. Basically API cant find jar itself.
Following returns empty
JavaSparkContext.jarOfObject(this);
JavaSparkContext.jarOfClass(this.getClass())
Its an excellent API only if it worked! Any one else able to make use of this?
[I have included example for Scala. I am sure it will work the same way for Java]
It will work if you do :
SparkContext.jarOfObject(this.getClass)
Surprisingly, this works for Scala Object as well as Scala Class.
How are you running the app? If you are running it from an IDE or compilation tool such as sbt, then
the jar is not package when running,
if you have package it once before, then your /path/to/jar/Sample.jar exist, thus giving the hard coded path works, but this is not the .class that the jvm runing your app is using and it cannot find.

How to add the Breeze which is build by myself to Apache Spark?

I added some methods to the Breeze library and I can see those methods through IDE. And I was trying to add the Breeze library which is build by myself to my project which is based on Apache Spark. However, when I package my project by command "sbt assembly" and run it on my cluster, it throws an error "no such method xxx" which means the cluster actually didn't run my Breeze library. So could anyone tell me how to make the cluster run the Breeze library which is build by myself?
I have a guess that spark uses some version of Breeze libraries itself and prefer them over you custom .jars in assembly. You can try to build spark with your custom library. Install your library in your local maven repository, specify it in apache spark's pom.xml and build your own spark version.

Resources