java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible - apache-spark

After locally compiling spark v3.2.1, we deployed it in k8s. We observe below stacktrace in spark-master. Any clues to resolve this exception would be helpful.
JDK - openjdk8 (1.8)
Scala - 2.12.15
Hadoop - 3.3.1
2022-06-23 05:27:25.847 GMT ERROR [SPARK_MASTER] TransportRequestHandler: [rpc-server-4-1] Error while invoking RpcHandler#receive() for one-way message. java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = 6543101073799644159, local class serialVersionUID = 1574364215946805297

Found out that this exception occurs due to scala version incompatibility. Rebuilding after cleaning up the old spark version image in build machine helped to overcome this issue.

Related

org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
.format("console").outputMode('complete').start()
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: java.io.InvalidClassException: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?
This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12') \
.getOrCreate()
This way spark will download JAR files through maven.

Apache Beam Issue with Spark Runner while using Kafka IO

I am trying to test KafkaIO for the Apache Beam Code with a Spark Runner.
The code works fine with a Direct Runner.
However, if I add below codeline it throws error:
options.setRunner(SparkRunner.class);
Error:
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2.0 (TID 0)
java.lang.StackOverflowError
at java.base/java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:3307)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2135)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:482)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:440)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
at jdk.internal.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
Versions that I am trying to use:
<beam.version>2.33.0</beam.version>
<spark.version>3.1.2</spark.version>
<kafka.version>3.0.0</kafka.version>
This issue is resolved by adding VM argument: -Xss2M
This link helped me to solve this issue:
https://github.com/eclipse-openj9/openj9/issues/10370

How to spark-submit remotely to EMR as Client mode?

I have a ECS task configured to run spark-submit to EMR Cluster. The spark-submit is configured as Yarn Cluster mode.
My streaming application is suppose to save data to Redshift on an RDD, but I'm getting this error:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:162)
at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:386)
at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
...
I suspect that because "spark.yarn.jars" was not set so it pushed my remote server's $SPARK_HOME libraries over and it's missing the jars for com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
So, I also attempted to set "spark.yarn.jars=hdfs://nodename:8020/user/spark/jars/*.jar" after I copied EMR's masternode's /usr/lib/spark/jars/* over. Then it errors:
java.io.InvalidClassException: org.apache.spark.sql.execution.SparkPlan; local class incompatible: stream classdesc serialVersionUID = -7931627949087445875, local class serialVersionUID = -5425351703039338847
I think there may be a mismatch in jars between the remote client's jars to EMR's clusters' jars. But they're both version 2.4.7.
Anyone have any clever solution to get my streaming spark-submit job working in EMR as yarn client mode?
The binaries needs to be the same as those in EMR Cluster.
This resource helped me resolve this issue:
https://docs.dominodatalab.com/en/4.5.2/reference/spark/external_spark/Connecting_to_an_Amazon_EMR_cluster_from_Domino.html

why spark job don't work on zepplin while they work when using pyspark shell

i'am trying to execute the following code on zepplin
df = spark.read.csv('/path/to/csv')
df.show(3)
but i get the following error
Py4JJavaError: An error occurred while calling o786.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 39.0 failed 4 times, most recent failure: Lost task 5.3 in stage 39.0 (TID 326, 172.16.23.92, executor 0): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3
i have hadoop-2.7.3 running on 2 nodes cluster and spark 2.3.2 running on standalone mode and zeppelin 0.8.1, this problem only occur when using zepplin
and i have the SPARK_HOME in zeppelin configuration.
I solved it, the problem was that zeppelin was using a commons-lang3-3.5.jar and spark using commons-lang-2.6.jar so all i did is to add the jar path to zeppelin configuration on the interpreter menu:
1-Click 'Interpreter' menu in navigation bar.
2-Click 'edit' button of the interpreter which you want to load dependencies to.
3-Fill artifact and exclude field to your needs. Add the path to the respective jar file.
4-Press 'Save' to restart the interpreter with loaded libraries.
Zeppelin is using its commons-lang2 jar to stream to Spark executors while Spark local is using common-lang3. like Achref mentioned, just fill out artifact location of commons-lang3 and restart interpreter then you should be good.

An Apache Beam pipeline on Azure HDInsight's SparkRunner

I try to get a Beam pipeline to run on Azure's HDInsight SparkRunner.
I tried first with a cluster based on Spark 2.3.0/Hadoop 2.7 (HDI 3.6) and then also 2.3.1/Hadoop 3.0 (HDI 4.0 Preview).
I tried using Apache Beam 2.2.0 and next 2.10.0-SNAPSHOT.
The spark-submit command is (for Beam 2.10.0):
JARS="wasbs:///dependency/hadoop-azure-3.1.1.3.0.2.0-50.jar,wasbs:///dependency/azure-storage-7.0.0.jar,wasbs:///dependency/beam-model-fn-execution-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-model-job-management-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-model-pipeline-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-core-construction-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-core-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-direct-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-spark-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-core-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-fn-execution-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-io-hadoop-file-system-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-vendor-grpc-1_13_1-0.1.jar"
spark-submit --conf spark.yarn.maxAppAttempts=1 --deploy-mode cluster --master yarn --jars $JARS --class example.MinimalWordCountJava8 wasbs:///mavenproject1-1.0-SNAPSHOT.jar --runner=SparkRunner
(initially -jars was not given the hadoop-azure and azure-storage jars, but that did not make any difference).
The main() looks like this:
public static void main(String[] args) {
JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasb");
config.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasbs");
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key." + account + ".blob.core.windows.net", key);
config.set("fs.defaultFS", "wasb://" + container + "#" + account + ".blob.core.windows.net");
System.out.println("### hello.txt content:");
JavaRDD<String> content = ct.textFile("wasbs:///hello.txt");
System.out.println(content.toString());
System.out.println("### MinimalWordCountJava8");
PipelineOptions options = PipelineOptionsFactory.create();
SparkContextOptions sparkContextOptions = options.as(SparkContextOptions.class);
sparkContextOptions.setUsesProvidedSparkContext(true);
sparkContextOptions.setProvidedSparkContext(ct);
sparkContextOptions.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(sparkContextOptions);
p.apply(TextIO.read().from("hello.txt"))
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))
.apply(Filter.by((String word) -> !word.isEmpty()))
.apply(Count.<String>perElement())
.apply(MapElements
.into(TypeDescriptors.strings())
.via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()))
// CHANGE 3/3: The Google Cloud Storage path is required for outputting the results to.
.apply(TextIO.write().to("output"));
p.run().waitUntilFinish();
It fails when calling Pipeline.create(options); with this exception trace:
18/12/09 14:47:10 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:59)
at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:489)
at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:479)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:145)
at io.aptly.mavenproject1.MinimalWordCountJava8.main(MinimalWordCountJava8.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "wasbs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:3377)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:530)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:542)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.<init>(HadoopFileSystem.java:82)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:56)
... 10 more
18/12/09 14:47:10 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:59)
at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:489)
at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:479)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:145)
at io.aptly.mavenproject1.MinimalWordCountJava8.main(MinimalWordCountJava8.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "wasbs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:3377)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:530)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:542)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.<init>(HadoopFileSystem.java:82)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:56)
The submit works (the wasps:// is recognised) and reading the small wasps:///hello.txt does not fail. These cases indicate that using wasps:// is fine until that point.
It's early inside Beam, that it seems to fail.
Because of this I passed the JavaSparkContext with the PipelineOptions (with dynamic hadoop configurations that were suggested by other SO question/answers). But this did not make a difference for me.
Anyone who can guide on how to get around this issue?
From quickly digging through code and bug trackers, it looks like Azure is supported as a Hadoop filesystem starting with Hadoop 3.2.0 (code, Jira). Currently Beam is pinned to version 2.7.3. This would explain the failure in Beam's HadoopFilesystem.
It may be that spark-submit succeeded because wasbs:// is supported via a different mechanism than Hadoop's libraries or using a bundled and newer version of Hadoop.

Resources