Are Spark 3.1.2 & Spark 3.2.1 backward compatible? - apache-spark

I want to upgrade Spark 3.1.2 to the latest version, which now is 3.2.1.
I read the release notes and it seems like the changes are not so significant.
Has anybody encountered some issues during the same update?
Hadoop version I use is 3.1.3. Java version - 1.8

Related

Load XML file to dataframe in PySpark using DBR 7.3.x+

I'm trying to load an XML file in to dataframe using PySpark in databricks notebook.
df = spark.read.format("xml").options(
rowTag="product" , mode="PERMISSIVE", columnNameOfCorruptRecord="error_record"
).load(filePath)
On doing so, I get following error:
Could not initialize class com.databricks.spark.xml.util.PermissiveMode$
Databricks runtime version : 7.3 LTS Spark version : 3.0.1 Scala version : 2.12
The same code block runs perfectly fine in DBR 6.4 Spark 2.4.5, Scala 2.11
You need to upgrade version of spark_xml library to a version compiled for Scala 2.12, because the version that works for DBR 6.4 isn't compatible with new Scala version. So, instead of spark-xml_2.11 you need to use spark-xml_2.12.
P.S. I just checked with DBR 7.3 LTS & com.databricks:spark-xml_2.12:0.11.0 - works just fine.

Spark for Java 11

Spark 2.x when being used for Java 11 gives below error
Exception in thread "main" java.lang.IllegalArgumentException: Unsupported class file major version 55
Does Spark 3.0 has this compatibility with Java 11?
Is there any another workaround to use Java 11 with Spark?
From Spark 3 documentation:
Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. Python 2 and Python 3 prior to version 3.6 support is deprecated as of Spark 3.0.0. R prior to version 3.4 support is deprecated as of Spark 3.0.0. For the Scala API, Spark 3.0.0 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
So Spark 3.x works with Java11, Spark 2.x does not as pointed out in this answer.
For java 11 we have spark-streaming_2.12 artifact with 3.1.1 version

Which Scala version does Spark 2.4.3 uses?

I installed Scala(Version 2.12.8) and Spark(2.4.3) on my Mac OS from homebrew. I already have Java 1.8 installed on my machine.
When I launch spark-shell, I see the logo says:
Spark version 2.4.3, Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Why does is says Scala version 2.11.12 instead of Scala(Version 2.12.8) which installed on my machine?
Does Spark 2.4.3 come with Scala 2.11.12?
Thanks.
As stated in the release notes:
Spark 2.4.3 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release.
Note that 2.4.3 switched the default Scala version from Scala 2.12 to Scala 2.11, which is the default for all the previous 2.x releases except 2.4.2. That means, the pre-built convenience binaries are compiled for Scala 2.11. Spark is still cross-published for 2.11 and 2.12 in Maven Central, and can be built for 2.12 from source.
Additionally Scala version you happen to have on your machine is completely irrelevant - Spark uses Scala version that has been used to compile it.
Once we start writing spark code, we need to import spark-core and spark-sql in project. If right versions are not installed, code compilation or runtime fails with missing definitions.
To choose the right version of spark and scala libraries:
See the spark version installed by running the spark-shell. It shows the spark and scala versions both. Use those versions only while importing in projects.
For example in SBT: Spark 2.4.5 supports 2.11.12 of scala
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
scalaVersion := "2.11.12"

spark-cassnadra connector issue

I am using spark 1.6.2 with scala version 2.10.5.
Now I have installed cassndra locally and downloaded spark-cassandra-connector_2.10-1.6.2.jar from https://spark-packages.org/package/datastax/spark-cassandra-connector
But when I am trying to fire up the spark shell from the cassandra using the connector I am getting this error
can some one please help me if I am downloading the wrong version of the connector or there are some other issues?
Just put : between spark-cassandra-connector and 1.6.2 instead of _, and remove the ; character after the version of connector...
spark-shell --packages datastax:spark-cassandra-connector:1.6.2-s_2.10
But it's better to use latest from 1.6.x release: 1.6.11 instead of 1.6.2

How to safely remove Spark 2.2.0 and install Spark 2.1.0 instead?

I have recently installed Spark 2.2.0 and started configuration, but unfortunately 2.2.0 does not support sparklyr library yet. What is the safest way to exchange it with 2.1.0?

Resources