How can spark-shell work without installing Scala beforehand? - apache-spark

I have downloaded Spark 1.2.0 (pre-built for Hadoop 2.4). In its quick start doc, it says:
It is available in either Scala or Python.
What confuses me is that my computer doesn't have Scala installed separately before (OS X 10.10), but when I type spark-shell it works well, and the output shows:
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25)
as depicted in the screenshot:
I didn't install any Scala distribution before.
How can spark-shell run without Scala?

tl;dr Scala binaries are included in Spark already (to make Spark users' life easier).
Under Downloading in Spark Overview you can read about what is required to run Spark:
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS).
It’s easy to run locally on one machine — all you need is to have java
installed on your system PATH, or the JAVA_HOME environment variable
pointing to a Java installation.
Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark 1.2.0
uses Scala 2.10. You will need to use a compatible Scala version
(2.10.x).

Scala program, including spark-shell, is compiled to Java byte code, which can be run with Java virtual machine (JVM). Therefore, as long as you have JVM installed, meaning java command, you can run the Spark-related tools written in Scala.

Related

Which Scala version does Spark 2.4.3 uses?

I installed Scala(Version 2.12.8) and Spark(2.4.3) on my Mac OS from homebrew. I already have Java 1.8 installed on my machine.
When I launch spark-shell, I see the logo says:
Spark version 2.4.3, Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Why does is says Scala version 2.11.12 instead of Scala(Version 2.12.8) which installed on my machine?
Does Spark 2.4.3 come with Scala 2.11.12?
Thanks.
As stated in the release notes:
Spark 2.4.3 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release.
Note that 2.4.3 switched the default Scala version from Scala 2.12 to Scala 2.11, which is the default for all the previous 2.x releases except 2.4.2. That means, the pre-built convenience binaries are compiled for Scala 2.11. Spark is still cross-published for 2.11 and 2.12 in Maven Central, and can be built for 2.12 from source.
Additionally Scala version you happen to have on your machine is completely irrelevant - Spark uses Scala version that has been used to compile it.
Once we start writing spark code, we need to import spark-core and spark-sql in project. If right versions are not installed, code compilation or runtime fails with missing definitions.
To choose the right version of spark and scala libraries:
See the spark version installed by running the spark-shell. It shows the spark and scala versions both. Use those versions only while importing in projects.
For example in SBT: Spark 2.4.5 supports 2.11.12 of scala
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
scalaVersion := "2.11.12"

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using
pip install pyspark
I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/#GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).
Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:
what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)
As of v2.2, executing pip install pyspark will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars
PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.
PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

java.lang.NoSuchMethodError when loading external FAT-JARs in Zeppelin

While trying to run a piece of code that used some FAT JARs (that share some common submodules) built using sbt assembly, I'm running into this nasty java.lang.NoSuchMethodError
The JAR is built on EMR itself (and not uploaded from some other environment), so version conflict in libraries / Spark / Scala etc is unlikely
My EMR environment:
Release label: emr-5.11.0
Hadoop distribution: Amazon 2.7.3
Applications: Spark 2.2.1, Zeppelin 0.7.3, Ganglia 3.7.2, Hive 2.3.2, Livy 0.4.0, Sqoop 1.4.6, Presto 0.187
Project configurations:
Scala 2.11.11
Spark 2.2.1
SBT 1.0.3
It turned out that the real culprit were the shared submodules in those jars.
Two fat jars built out of projects containing common submodules were leading to this conflict. Removing one of those jars resolved the issue.
I'm not sure if this conflict happened only under some particular circumstances or would always occur upon uploading such jars (that have same submodules) in Zeppelin interpreter, so still waiting for proper explanation.

How to install Spark 2.1.0 on Windows 7 64-bit?

I'm on Windows 7 64-bit and am following this blog to install Spark 2.1.0.
So I tried to build Spark from the sources that I'd cloned from https://github.com/apache/spark to C:\spark-2.1.0.
When I run sbt assembly or sbt -J-Xms2048m -J-Xmx2048m assembly, I get:
[info] Loading project definition from C:\spark-2.1.0\project
[info] Compiling 3 Scala sources to C:\spark-2.1.0\project\target\scala-2.10\sbt-0.13\classes...
java.lang.StackOverflowError
at java.security.AccessController.doPrivileged(Native Method)
at java.io.PrintWriter.<init>(Unknown Source)
at java.io.PrintWriter.<init>(Unknown Source)
at scala.reflect.api.Printers$class.render(Printers.scala:168)
at scala.reflect.api.Universe.render(Universe.scala:59)
at scala.reflect.api.Printers$class.show(Printers.scala:190)
at scala.reflect.api.Universe.show(Universe.scala:59)
at scala.reflect.api.Printers$class.treeToString(Printers.scala:182)
...
I adapted the memory settings of sbt as suggested, which are ignored anyway. Any ideas?
The linked blog post was "Posted on April 29, 2015" that's 2 years old now and should only be read to learn how things have changed since (I'm not even going to link the blog post to stop directing people to the site).
The 2017 way of installing Spark on Windows is as follows:
Download Spark from http://spark.apache.org/downloads.html.
Read the official documentation starting from Downloading.
That's it.
Installing Spark on Windows
Windows is known to give you problems due to Hadoop's requirements (and Spark does use Hadoop API under the covers).
You'll have to install winutils binary that you can find at https://github.com/steveloughran/winutils repository.
TIP: You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2.1.0.
Save winutils.exe binary to a directory of your choice, e.g. c:\hadoop\bin and define HADOOP_HOME to include c:\hadoop.
See Running Spark Applications on Windows for further steps.
The following settings worked for me (sbtconfig.txt):
# Set the java args to high
-Xmx1024M
-XX:MaxPermSize=2048m
-Xss2M
-XX:ReservedCodeCacheSize=128m
# Set the extra SBT options
-Dsbt.log.format=true

Running R on amazon EMR with spark 1.6 and Zeppelin 0.5.6

I am trying to setup the R interpreter to run in Zeppelin which is currently running on EMR. Zeppelin is working perfectly and I am able to write script in Scala and Python. When I use %r, %sparkR or %knitr I receive an error : "r interpreter not found"
The applications which I have running in my emr-4.7.2 cluster are: Hive 1.0.0, Zeppelin-Sandbox 0.5.6, Spark 1.6.2, Pig 0.14.0
Within the interpreter there is no mention of R so figure I am missing something but do not know what.
Any pointers greatly appreciated.
Zeppelin on Amazon EMR (till at least emr-5.0.0) does not support the SparkR interpreter.
You ought following the Elastic Map Reduce Release Guide/Zeppelin documentation to get more information.

Resources