error while writing Dataframe into HDFS path [duplicate] - apache-spark

I begin to test spark.
I installed spark on my local machine and run a local cluster with a single worker. when I tried to execute my job from my IDE by setting the sparconf as follows:
final SparkConf conf = new SparkConf().setAppName("testSparkfromJava").setMaster("spark://XXXXXXXXXX:7077");
final JavaSparkContext sc = new JavaSparkContext(conf);
final JavaRDD<String> distFile = sc.textFile(Paths.get("").toAbsolutePath().toString() + "dataSpark/datastores.json");*
I got this exception:
java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -5447855329526097695, local class serialVersionUID = -2221986757032131007

It can be multiple incompatible reasons below:
Hadoop version;
Spark version;
Scala version;
...
For me, its Scala version , I using 2.11.X in my IDE but official doc says:
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
and the x in the doc told cannot be smaller than 3 if you using latest Java(1.8), cause this.
Hope it will help you!

Got it all working with below combination of versions
Installed spark 1.6.2
verify with bin/spark-submit --version
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.2</version>
</dependency>
and
Scala 2.10.6 and Java 8.
Note it did NOT work and have similar class incompatible issue with below versions
Scala 2.11.8 and Java 8
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.6.2</version>
</dependency>

Looks your installed Spark version is not same as the Spark version used in your IDE.
If you are using maven, just compare the version of the dependency declared in pom.xml and the output of bin/spark-submit --version and make sure they are same.

I faced this issue because Spark jar dependency was 2.1.0 but installed Spark Engine version is 2.0.0 Hence version mismatch, So it throws this exception.
The root cause of this problem is version mismatch of Spark jar dependency in project and installed Spark Engine where execute spark job is running.
Hence verify both versions and make them identical.
Example Spark-core Jar version 2.1.0 and Spark Computation Engine version must be: 2.1.0
Spark-core Jar version 2.0.0 and Spark Computation Engine version must be: 2.0.0
It's working for me perfectly.

I had this problem.
when I run the code with spark-submit it works (instead of running with IDE).
./bin/spark-submit --master spark://HOST:PORT target/APP-NAME.jar

Related

Compatible version of Scala for Spark 2.4.2 & EMR 5.24.1

What Scala version should I use to compile/build with Spark 2.4.2? I tried with Scala 2.12 and got the below error message.
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
Any input is really appreciated
Upon checking, 2.11.12 looks to be the scala version on EMR.
It worked for us.

Spark netty Version Mismatch on HDInsight Cluster

I am currently having an issue when running my Spark job remotely in a HDInsight Cluster:
My project has a dependency on netty-all and here is what I explicitly specify for it in the pom file:
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
<version>4.1.51.Final</version>
</dependency>
The final built jar includes this package with the specified version and running the Spark job on my local machine works fine. However, when I try to run it in the remote HDInsight cluster, the job throws the following exception:
java.lang.NoSuchMethodError: io.netty.handler.ssl.SslProvider.isAlpnSupported(Lio/netty/handler/ssl/SslProvider;)Z
I believe this is due to the netty version mismatch as Spark was picking up the old netty version (netty-all-4.1.17) from its default system classpath in the remote cluster rather than the newer netty package defined in the uber jar.
I have tried different ways to resolve this issue but they don't seem to work well:
Relocating classes using Maven Shade plugin:
More details and its issues are here - Missing Abstract Class using Maven Shade Plugin for Relocating Classes
Spark configurations
spark.driver.extraClassPath=<path to netty-all-4.1.50.Final.jar>
spark.executor.extraClassPath=<path to netty-all-4.1.50.Final.jar>
Would like to know if there is any other solutions to solve this issue or any steps missing here?
You will need to ensure you only have Netty 4.1.50.Final or higher on the classpath

Which Scala version does Spark 2.4.3 uses?

I installed Scala(Version 2.12.8) and Spark(2.4.3) on my Mac OS from homebrew. I already have Java 1.8 installed on my machine.
When I launch spark-shell, I see the logo says:
Spark version 2.4.3, Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Why does is says Scala version 2.11.12 instead of Scala(Version 2.12.8) which installed on my machine?
Does Spark 2.4.3 come with Scala 2.11.12?
Thanks.
As stated in the release notes:
Spark 2.4.3 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release.
Note that 2.4.3 switched the default Scala version from Scala 2.12 to Scala 2.11, which is the default for all the previous 2.x releases except 2.4.2. That means, the pre-built convenience binaries are compiled for Scala 2.11. Spark is still cross-published for 2.11 and 2.12 in Maven Central, and can be built for 2.12 from source.
Additionally Scala version you happen to have on your machine is completely irrelevant - Spark uses Scala version that has been used to compile it.
Once we start writing spark code, we need to import spark-core and spark-sql in project. If right versions are not installed, code compilation or runtime fails with missing definitions.
To choose the right version of spark and scala libraries:
See the spark version installed by running the spark-shell. It shows the spark and scala versions both. Use those versions only while importing in projects.
For example in SBT: Spark 2.4.5 supports 2.11.12 of scala
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
scalaVersion := "2.11.12"

java.lang.NoSuchMethodError when loading external FAT-JARs in Zeppelin

While trying to run a piece of code that used some FAT JARs (that share some common submodules) built using sbt assembly, I'm running into this nasty java.lang.NoSuchMethodError
The JAR is built on EMR itself (and not uploaded from some other environment), so version conflict in libraries / Spark / Scala etc is unlikely
My EMR environment:
Release label: emr-5.11.0
Hadoop distribution: Amazon 2.7.3
Applications: Spark 2.2.1, Zeppelin 0.7.3, Ganglia 3.7.2, Hive 2.3.2, Livy 0.4.0, Sqoop 1.4.6, Presto 0.187
Project configurations:
Scala 2.11.11
Spark 2.2.1
SBT 1.0.3
It turned out that the real culprit were the shared submodules in those jars.
Two fat jars built out of projects containing common submodules were leading to this conflict. Removing one of those jars resolved the issue.
I'm not sure if this conflict happened only under some particular circumstances or would always occur upon uploading such jars (that have same submodules) in Zeppelin interpreter, so still waiting for proper explanation.

Spark2.2 interpreter on Zeppelin 0.7.0 running on HDP 2.6

Simply trying to test a Zeppelin interpreter to run Spark 2.2 on YARN on Zeppelin 0.7.0(HDP2.6) but repeatedly getting:
java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
All I am running is
%spark2
sc.version
With the same Spark 2.2 I can run spark-submit s and spark-shell operations running on YARN(locally and remotely) but can't make Zeppelin listen to this new version of Spark. Does Zeppelin-HDP only support Spark 2.1 and 1.6? (My Spark 2.2 is a custom installation).
The only thing that makes me believe the above is that i can see in the logs of testing the Zeppelin notebook:
Added JAR file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.10-0.7.0.2.6.0.3-8.jar
which appears to be a HDP-specific zeppelin JAR.
Please help.
Yes you are right. I was hitting a similar issue while I was running zeppelin 0.7.0 and spark 2.2.0 on mesos. Infact have a look at this commit:
https://github.com/apache/zeppelin/commit/28310c2b95785d8b9e63bc0adc5a26df8b3c9dec
The support seems to be added in 0.7.3 so try upgrading zeppelin and give it a try. I built zeppelin from master branch and it worked for me but the tag v0.7.3 should work fine as well.

Resources