Which Scala version does Spark 2.4.3 uses? - apache-spark

I installed Scala(Version 2.12.8) and Spark(2.4.3) on my Mac OS from homebrew. I already have Java 1.8 installed on my machine.
When I launch spark-shell, I see the logo says:
Spark version 2.4.3, Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Why does is says Scala version 2.11.12 instead of Scala(Version 2.12.8) which installed on my machine?
Does Spark 2.4.3 come with Scala 2.11.12?
Thanks.

As stated in the release notes:
Spark 2.4.3 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release.
Note that 2.4.3 switched the default Scala version from Scala 2.12 to Scala 2.11, which is the default for all the previous 2.x releases except 2.4.2. That means, the pre-built convenience binaries are compiled for Scala 2.11. Spark is still cross-published for 2.11 and 2.12 in Maven Central, and can be built for 2.12 from source.
Additionally Scala version you happen to have on your machine is completely irrelevant - Spark uses Scala version that has been used to compile it.

Once we start writing spark code, we need to import spark-core and spark-sql in project. If right versions are not installed, code compilation or runtime fails with missing definitions.
To choose the right version of spark and scala libraries:
See the spark version installed by running the spark-shell. It shows the spark and scala versions both. Use those versions only while importing in projects.
For example in SBT: Spark 2.4.5 supports 2.11.12 of scala
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
scalaVersion := "2.11.12"

Related

Are Spark 3.1.2 & Spark 3.2.1 backward compatible?

I want to upgrade Spark 3.1.2 to the latest version, which now is 3.2.1.
I read the release notes and it seems like the changes are not so significant.
Has anybody encountered some issues during the same update?
Hadoop version I use is 3.1.3. Java version - 1.8

Load XML file to dataframe in PySpark using DBR 7.3.x+

I'm trying to load an XML file in to dataframe using PySpark in databricks notebook.
df = spark.read.format("xml").options(
rowTag="product" , mode="PERMISSIVE", columnNameOfCorruptRecord="error_record"
).load(filePath)
On doing so, I get following error:
Could not initialize class com.databricks.spark.xml.util.PermissiveMode$
Databricks runtime version : 7.3 LTS Spark version : 3.0.1 Scala version : 2.12
The same code block runs perfectly fine in DBR 6.4 Spark 2.4.5, Scala 2.11
You need to upgrade version of spark_xml library to a version compiled for Scala 2.12, because the version that works for DBR 6.4 isn't compatible with new Scala version. So, instead of spark-xml_2.11 you need to use spark-xml_2.12.
P.S. I just checked with DBR 7.3 LTS & com.databricks:spark-xml_2.12:0.11.0 - works just fine.

Which Spark version should I download to run on top of Hadoop 3.1.2?

In Spark download page we can choose between releases 3.0.0-preview and 2.4.4.
For release 3.0.0-preview there are the package types
Pre-built for Apache Hadoop 2.7
Pre-built for Apache Hadoop 3.2 and later
Pre-built with user-provided Apache Hadoop
Source code
For release 2.4.4 there are the package types
Pre-built for Apache Hadoop 2.7
Pre-built for Apache Hadoop 2.6
Pre-built with user-provided Apache Hadoop
Pre-built with Scala 2.12 and user-provided Apache Hadoop
Source code
Since there isn't a Pre-built for Apache Hadoop 3.1.2 option, can I download a Pre-built with user-provided Apache Hadoop package or should I download Source code?
If you are comfortable building source code, then that is your best option.
Otherwise, you already have a Hadoop cluster, so pick "user-provided" and copy your relevant core-site.xml, hive-site.xml, yarn-site.xml, and hdfs-site.xml all into the $SPARK_CONF_DIR, and it hopefully mostly will work
Note: DataFrames don't work on Hadoop 3 until Spark 3.x - SPARK-18673

java.lang.NoSuchMethodError when loading external FAT-JARs in Zeppelin

While trying to run a piece of code that used some FAT JARs (that share some common submodules) built using sbt assembly, I'm running into this nasty java.lang.NoSuchMethodError
The JAR is built on EMR itself (and not uploaded from some other environment), so version conflict in libraries / Spark / Scala etc is unlikely
My EMR environment:
Release label: emr-5.11.0
Hadoop distribution: Amazon 2.7.3
Applications: Spark 2.2.1, Zeppelin 0.7.3, Ganglia 3.7.2, Hive 2.3.2, Livy 0.4.0, Sqoop 1.4.6, Presto 0.187
Project configurations:
Scala 2.11.11
Spark 2.2.1
SBT 1.0.3
It turned out that the real culprit were the shared submodules in those jars.
Two fat jars built out of projects containing common submodules were leading to this conflict. Removing one of those jars resolved the issue.
I'm not sure if this conflict happened only under some particular circumstances or would always occur upon uploading such jars (that have same submodules) in Zeppelin interpreter, so still waiting for proper explanation.

How can spark-shell work without installing Scala beforehand?

I have downloaded Spark 1.2.0 (pre-built for Hadoop 2.4). In its quick start doc, it says:
It is available in either Scala or Python.
What confuses me is that my computer doesn't have Scala installed separately before (OS X 10.10), but when I type spark-shell it works well, and the output shows:
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25)
as depicted in the screenshot:
I didn't install any Scala distribution before.
How can spark-shell run without Scala?
tl;dr Scala binaries are included in Spark already (to make Spark users' life easier).
Under Downloading in Spark Overview you can read about what is required to run Spark:
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS).
It’s easy to run locally on one machine — all you need is to have java
installed on your system PATH, or the JAVA_HOME environment variable
pointing to a Java installation.
Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark 1.2.0
uses Scala 2.10. You will need to use a compatible Scala version
(2.10.x).
Scala program, including spark-shell, is compiled to Java byte code, which can be run with Java virtual machine (JVM). Therefore, as long as you have JVM installed, meaning java command, you can run the Spark-related tools written in Scala.

Resources