Spark for Java 11 - apache-spark

Spark 2.x when being used for Java 11 gives below error
Exception in thread "main" java.lang.IllegalArgumentException: Unsupported class file major version 55
Does Spark 3.0 has this compatibility with Java 11?
Is there any another workaround to use Java 11 with Spark?

From Spark 3 documentation:
Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. Python 2 and Python 3 prior to version 3.6 support is deprecated as of Spark 3.0.0. R prior to version 3.4 support is deprecated as of Spark 3.0.0. For the Scala API, Spark 3.0.0 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
So Spark 3.x works with Java11, Spark 2.x does not as pointed out in this answer.

For java 11 we have spark-streaming_2.12 artifact with 3.1.1 version

Related

Are Spark 3.1.2 & Spark 3.2.1 backward compatible?

I want to upgrade Spark 3.1.2 to the latest version, which now is 3.2.1.
I read the release notes and it seems like the changes are not so significant.
Has anybody encountered some issues during the same update?
Hadoop version I use is 3.1.3. Java version - 1.8

Compatible version of Scala for Spark 2.4.2 & EMR 5.24.1

What Scala version should I use to compile/build with Spark 2.4.2? I tried with Scala 2.12 and got the below error message.
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
Any input is really appreciated
Upon checking, 2.11.12 looks to be the scala version on EMR.
It worked for us.

Load XML file to dataframe in PySpark using DBR 7.3.x+

I'm trying to load an XML file in to dataframe using PySpark in databricks notebook.
df = spark.read.format("xml").options(
rowTag="product" , mode="PERMISSIVE", columnNameOfCorruptRecord="error_record"
).load(filePath)
On doing so, I get following error:
Could not initialize class com.databricks.spark.xml.util.PermissiveMode$
Databricks runtime version : 7.3 LTS Spark version : 3.0.1 Scala version : 2.12
The same code block runs perfectly fine in DBR 6.4 Spark 2.4.5, Scala 2.11
You need to upgrade version of spark_xml library to a version compiled for Scala 2.12, because the version that works for DBR 6.4 isn't compatible with new Scala version. So, instead of spark-xml_2.11 you need to use spark-xml_2.12.
P.S. I just checked with DBR 7.3 LTS & com.databricks:spark-xml_2.12:0.11.0 - works just fine.

pyspark compatible hadoop aws and aws adk for version 2.4.4

I am trying to read and write from s3 buckets using pyspark with the help of these two libraries from maven https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.7 and https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 which are really old. I tried with the different combinations of hadoop-aws and aws-java-SDK but it's not working with the pyspark version 2.4.4 . does anyone know which versions of Hadoop and java SDK's are compatible with spark version 2.4.4?
I am using the following:
Spark: 2.4.4
Hadoop: 2.7.3
Haddop-AWS: hadoop-aws-2.7.3.jar
AWS-JAVA-SDK: aws-java-sdk-1.7.3.jar
Scala: 2.11
Works for me and use s3a://bucket-name/
(Note: For PySPark I used aws-java-sdk-1.7.4.jar) because I wasn't able to use
df.write.csv(path=path, mode="overwrite", compression="None")

Which Scala version does Spark 2.4.3 uses?

I installed Scala(Version 2.12.8) and Spark(2.4.3) on my Mac OS from homebrew. I already have Java 1.8 installed on my machine.
When I launch spark-shell, I see the logo says:
Spark version 2.4.3, Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Why does is says Scala version 2.11.12 instead of Scala(Version 2.12.8) which installed on my machine?
Does Spark 2.4.3 come with Scala 2.11.12?
Thanks.
As stated in the release notes:
Spark 2.4.3 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release.
Note that 2.4.3 switched the default Scala version from Scala 2.12 to Scala 2.11, which is the default for all the previous 2.x releases except 2.4.2. That means, the pre-built convenience binaries are compiled for Scala 2.11. Spark is still cross-published for 2.11 and 2.12 in Maven Central, and can be built for 2.12 from source.
Additionally Scala version you happen to have on your machine is completely irrelevant - Spark uses Scala version that has been used to compile it.
Once we start writing spark code, we need to import spark-core and spark-sql in project. If right versions are not installed, code compilation or runtime fails with missing definitions.
To choose the right version of spark and scala libraries:
See the spark version installed by running the spark-shell. It shows the spark and scala versions both. Use those versions only while importing in projects.
For example in SBT: Spark 2.4.5 supports 2.11.12 of scala
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
scalaVersion := "2.11.12"

Resources