SBT console vs Spark-Shell for interactive development - apache-spark

I'm wondering if there are any important differences between using SBT console and Spark-shell for interactively developing new code for a Spark project (notebooks are not really an option w/ the server firewalls).
Both can import project dependencies, but for me SBT is a little more convenient. SBT automatically brings in all the dependencies in build.sbt and spark-shell can use the --jar, --packages, and --repositories arguments in the command line.
SBT has the handy initialCommands setting to automatically run lines at startup. I use this for initializing the SparkContext.
Are there others?

With SBT you need not install SPARK itself theoretically.
I use databricks.

From my experience sbt calls external jars innately spark shell calls series of imports and contexts innately. I prefer spark shell because it follows the standard you need to adhere to when build the spark submit session.
For running the code in production you need to build the code into jars, calling them via spark submit. To build that you need to package it via sbt (compilation check) and run the spark submit submit call (logic check).
You can develope using either tool but you should code as if you did not have the advantages of sbt (calling the jars) and spark shell (calling the imports and contexts) because spark submit doesn't do either.

Related

Unable to build Spark application with multiple main classes for Databricks job

I have a spark application that contains multiple spark jobs to be run on Azure data bricks. I want to build and package the application into a fat jar. The application is able to compile successfully. While I am trying to package (command: sbt package) the application, it gives an error "[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list".
How to build the application jar (Without specifying any main class) so that I can upload it to Databricks job and specify the main classpath over there?
This message is just a warning (see [warn] in it), it doesn't prevent generation of the jar files (normal or fat). Then you can upload resulting jar to DBFS (or ADLS for newer Databricks Runtime versions), and create Databricks job either as Jar task, or Spark Submit task.
If sbt fails, and doesn't produce jars, then you have some plugin that forces error on the warnings.
Also notice that sbt package doesn't produce fat jar - it produce jar only for classes in your project. You will need to use sbt assembly (install sbt-assembly plugin for that) to generate fat jar, but make sure that you marked Spark & Delta dependencies as provided.

How to run Spark processes in develop environment using a cluster?

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

Why does HDInsight cluster does not come with pre-installed Scala?

on HDInsight's masternode, $scala -verion returns an error. It is easily installed via
$apt-get install scala
but shouldn't scala be installed there by default?
Thank you for suggestion. What's the scenario where you need scala to be directly installed on the node? For example, in spark there are couple of other common scenarios that already work:
Running Spark commands in command line. This is accomplished through spark-shell which has built-in scala interpreter.
Building spark project. This is ussually done through maven or sbt project definition file. Those tools would automatically download correct scala version and compiler based on the project dependencies.
As you said it's not hard to preinstall scala, but we would like to understand the need to do that. In the discussions with customers this didn't come up before.

Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both

What is the precedence in class loading when both the uber jar of my spark application and the contents of --jars option to my spark-submit shell command contain similar dependencies ?
I ask this from a third-party library integration standpoint. If I set --jars to use a third-party library at version 2.0 and the uber jar coming into this spark-submit script was assembled using version 2.1, which class is loaded at runtime ?
At present, I think of keeping my dependencies on hdfs, and add them to the --jars option on spark-submit, while hoping via some end-user documentation to ask users to set the scope of this third-party library to be 'provided' in their spark application's maven pom file.
This is somewhat controlled with params:
spark.driver.userClassPathFirst &
spark.executor.userClassPathFirst
If set to true (default is false), from docs:
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
I wrote some of the code that controls this, and there were a few bugs in the early releases, but if you're using a recent Spark release it should work (although it is still an experimental feature).

Apache Spark app workflow

How do You organize the Spark development workflow?
My way:
Local hadoop/yarn service.
Local spark service.
Intellij on one screen
Terminal with running sbt console
After I change Spark app code, I switch to terminal and run "package" to compile to jar and "submitSpark" which is stb task that runs spark-submit
Wait for exception in sbt console :)
I also tried to work with spark-shell:
Run shell and load previously written app.
Write line in shell
Evaluate it
If it's fine copy to IDE
After few 2,3,4, paste code to IDE, compile spark app and start again
Is there any way to develop Spark apps faster?
I develop the core logic of our Spark jobs using an interactive environment for rapid prototyping. We use the Spark Notebook running against a development cluster for that purpose.
Once I've prototyped the logic and it's working as expected, I "industrialize" the code in a Scala project, with the classical build lifecycle: create tests; build, package and create artifacts by Jenkins.
I found writing scripts and using :load / :copy streamlined things a bit since I didn't need to package anything. If you do use sbt I suggest you start it and use ~ package such that it automatically packages the jar when changes are made. Eventually of course everything will end up in an application jar, this is for prototyping and exploring.
Local Spark
Vim
Spark-Shell
APIs
Console
We develop ours applications using an IDE (Intellij because we code your spark's applications in Scala) using scalaTest for testing.
In those tests we use local[*] as SparkMaster in order to allow the debugging.
For integration testing we used Jenkins and we launch an "end to end" script as an Scala application.
I hope this will be useful

Resources