Why does HDInsight cluster does not come with pre-installed Scala? - azure

on HDInsight's masternode, $scala -verion returns an error. It is easily installed via
$apt-get install scala
but shouldn't scala be installed there by default?

Thank you for suggestion. What's the scenario where you need scala to be directly installed on the node? For example, in spark there are couple of other common scenarios that already work:
Running Spark commands in command line. This is accomplished through spark-shell which has built-in scala interpreter.
Building spark project. This is ussually done through maven or sbt project definition file. Those tools would automatically download correct scala version and compiler based on the project dependencies.
As you said it's not hard to preinstall scala, but we would like to understand the need to do that. In the discussions with customers this didn't come up before.

Related

How to run Spark processes in develop environment using a cluster?

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

SBT console vs Spark-Shell for interactive development

I'm wondering if there are any important differences between using SBT console and Spark-shell for interactively developing new code for a Spark project (notebooks are not really an option w/ the server firewalls).
Both can import project dependencies, but for me SBT is a little more convenient. SBT automatically brings in all the dependencies in build.sbt and spark-shell can use the --jar, --packages, and --repositories arguments in the command line.
SBT has the handy initialCommands setting to automatically run lines at startup. I use this for initializing the SparkContext.
Are there others?
With SBT you need not install SPARK itself theoretically.
I use databricks.
From my experience sbt calls external jars innately spark shell calls series of imports and contexts innately. I prefer spark shell because it follows the standard you need to adhere to when build the spark submit session.
For running the code in production you need to build the code into jars, calling them via spark submit. To build that you need to package it via sbt (compilation check) and run the spark submit submit call (logic check).
You can develope using either tool but you should code as if you did not have the advantages of sbt (calling the jars) and spark shell (calling the imports and contexts) because spark submit doesn't do either.

Yahoo - caffe on Spark library dependency?

Yahoo just released a version of caffe that uses the latest version of Apache-Spark yesterday, the git repo is not well documented yet: git link
There is a scala test file which is suppose to run an example: Scala Example
but it requires the dependency com.yahoo.ml.caffe.{Config, CaffeOnSpark, DataSource} which I assume contains basically the data, the config and the API. Has this been made into a library yet? How could I build this using sbt?
Our CaffeOnSpark release contains all the code that you need to run. com.yahoo.ml.caffe.* are collection of Scala classes in caffe-grid folder. Please follow our guides at CaffeOnSpark wiki page, and ask questions at your mailing list.

Apache Spark app workflow

How do You organize the Spark development workflow?
My way:
Local hadoop/yarn service.
Local spark service.
Intellij on one screen
Terminal with running sbt console
After I change Spark app code, I switch to terminal and run "package" to compile to jar and "submitSpark" which is stb task that runs spark-submit
Wait for exception in sbt console :)
I also tried to work with spark-shell:
Run shell and load previously written app.
Write line in shell
Evaluate it
If it's fine copy to IDE
After few 2,3,4, paste code to IDE, compile spark app and start again
Is there any way to develop Spark apps faster?
I develop the core logic of our Spark jobs using an interactive environment for rapid prototyping. We use the Spark Notebook running against a development cluster for that purpose.
Once I've prototyped the logic and it's working as expected, I "industrialize" the code in a Scala project, with the classical build lifecycle: create tests; build, package and create artifacts by Jenkins.
I found writing scripts and using :load / :copy streamlined things a bit since I didn't need to package anything. If you do use sbt I suggest you start it and use ~ package such that it automatically packages the jar when changes are made. Eventually of course everything will end up in an application jar, this is for prototyping and exploring.
Local Spark
Vim
Spark-Shell
APIs
Console
We develop ours applications using an IDE (Intellij because we code your spark's applications in Scala) using scalaTest for testing.
In those tests we use local[*] as SparkMaster in order to allow the debugging.
For integration testing we used Jenkins and we launch an "end to end" script as an Scala application.
I hope this will be useful

How to add the Breeze which is build by myself to Apache Spark?

I added some methods to the Breeze library and I can see those methods through IDE. And I was trying to add the Breeze library which is build by myself to my project which is based on Apache Spark. However, when I package my project by command "sbt assembly" and run it on my cluster, it throws an error "no such method xxx" which means the cluster actually didn't run my Breeze library. So could anyone tell me how to make the cluster run the Breeze library which is build by myself?
I have a guess that spark uses some version of Breeze libraries itself and prefer them over you custom .jars in assembly. You can try to build spark with your custom library. Install your library in your local maven repository, specify it in apache spark's pom.xml and build your own spark version.

Resources