How to add the Breeze which is build by myself to Apache Spark? - apache-spark

I added some methods to the Breeze library and I can see those methods through IDE. And I was trying to add the Breeze library which is build by myself to my project which is based on Apache Spark. However, when I package my project by command "sbt assembly" and run it on my cluster, it throws an error "no such method xxx" which means the cluster actually didn't run my Breeze library. So could anyone tell me how to make the cluster run the Breeze library which is build by myself?

I have a guess that spark uses some version of Breeze libraries itself and prefer them over you custom .jars in assembly. You can try to build spark with your custom library. Install your library in your local maven repository, specify it in apache spark's pom.xml and build your own spark version.

Related

Unable to build Spark application with multiple main classes for Databricks job

I have a spark application that contains multiple spark jobs to be run on Azure data bricks. I want to build and package the application into a fat jar. The application is able to compile successfully. While I am trying to package (command: sbt package) the application, it gives an error "[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list".
How to build the application jar (Without specifying any main class) so that I can upload it to Databricks job and specify the main classpath over there?
This message is just a warning (see [warn] in it), it doesn't prevent generation of the jar files (normal or fat). Then you can upload resulting jar to DBFS (or ADLS for newer Databricks Runtime versions), and create Databricks job either as Jar task, or Spark Submit task.
If sbt fails, and doesn't produce jars, then you have some plugin that forces error on the warnings.
Also notice that sbt package doesn't produce fat jar - it produce jar only for classes in your project. You will need to use sbt assembly (install sbt-assembly plugin for that) to generate fat jar, but make sure that you marked Spark & Delta dependencies as provided.

How to run Spark processes in develop environment using a cluster?

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

Pass external dependency for building spark-submit Hydrosphere-mist

I'm currently using spark with Hydrosphere mist.
My project depends on external libraries(.jar files) and some other packages like
1. somejarfile.jar
2. org.apache.hadoop:hadoop-aws:2.7.4,
3. org.apache.hadoop:hadoop-client:2.7.4
To pass the external dependency in the mist I can Configure context with run-options.
to configure with the external jar.
mycontext.run-options="--jars somejarfile.jar"
to configure with external packages.
mycontext.run-options="--packages org.apache.hadoop:hadoop-aws:2.7.4,org.apache.hadoop:hadoop-client:2.7.4"
Is it possible to configure run-options with both jar and packages, something like below?
`mycontext.run-options="--jars somejarfile.jar, --packages org.apache.hadoop:hadoop-aws:2.7.4,org.apache.hadoop:hadoopclient:2.7.4"`.
I got stuck here. Any help in going with the above problem will be highly appreciated.

SBT console vs Spark-Shell for interactive development

I'm wondering if there are any important differences between using SBT console and Spark-shell for interactively developing new code for a Spark project (notebooks are not really an option w/ the server firewalls).
Both can import project dependencies, but for me SBT is a little more convenient. SBT automatically brings in all the dependencies in build.sbt and spark-shell can use the --jar, --packages, and --repositories arguments in the command line.
SBT has the handy initialCommands setting to automatically run lines at startup. I use this for initializing the SparkContext.
Are there others?
With SBT you need not install SPARK itself theoretically.
I use databricks.
From my experience sbt calls external jars innately spark shell calls series of imports and contexts innately. I prefer spark shell because it follows the standard you need to adhere to when build the spark submit session.
For running the code in production you need to build the code into jars, calling them via spark submit. To build that you need to package it via sbt (compilation check) and run the spark submit submit call (logic check).
You can develope using either tool but you should code as if you did not have the advantages of sbt (calling the jars) and spark shell (calling the imports and contexts) because spark submit doesn't do either.

Why does HDInsight cluster does not come with pre-installed Scala?

on HDInsight's masternode, $scala -verion returns an error. It is easily installed via
$apt-get install scala
but shouldn't scala be installed there by default?
Thank you for suggestion. What's the scenario where you need scala to be directly installed on the node? For example, in spark there are couple of other common scenarios that already work:
Running Spark commands in command line. This is accomplished through spark-shell which has built-in scala interpreter.
Building spark project. This is ussually done through maven or sbt project definition file. Those tools would automatically download correct scala version and compiler based on the project dependencies.
As you said it's not hard to preinstall scala, but we would like to understand the need to do that. In the discussions with customers this didn't come up before.

Resources