why spark job don't work on zepplin while they work when using pyspark shell - apache-spark

i'am trying to execute the following code on zepplin
df = spark.read.csv('/path/to/csv')
df.show(3)
but i get the following error
Py4JJavaError: An error occurred while calling o786.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 39.0 failed 4 times, most recent failure: Lost task 5.3 in stage 39.0 (TID 326, 172.16.23.92, executor 0): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3
i have hadoop-2.7.3 running on 2 nodes cluster and spark 2.3.2 running on standalone mode and zeppelin 0.8.1, this problem only occur when using zepplin
and i have the SPARK_HOME in zeppelin configuration.

I solved it, the problem was that zeppelin was using a commons-lang3-3.5.jar and spark using commons-lang-2.6.jar so all i did is to add the jar path to zeppelin configuration on the interpreter menu:
1-Click 'Interpreter' menu in navigation bar.
2-Click 'edit' button of the interpreter which you want to load dependencies to.
3-Fill artifact and exclude field to your needs. Add the path to the respective jar file.
4-Press 'Save' to restart the interpreter with loaded libraries.

Zeppelin is using its commons-lang2 jar to stream to Spark executors while Spark local is using common-lang3. like Achref mentioned, just fill out artifact location of commons-lang3 and restart interpreter then you should be good.

Related

org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
.format("console").outputMode('complete').start()
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: java.io.InvalidClassException: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?
This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12') \
.getOrCreate()
This way spark will download JAR files through maven.

Apache Beam Issue with Spark Runner while using Kafka IO

I am trying to test KafkaIO for the Apache Beam Code with a Spark Runner.
The code works fine with a Direct Runner.
However, if I add below codeline it throws error:
options.setRunner(SparkRunner.class);
Error:
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2.0 (TID 0)
java.lang.StackOverflowError
at java.base/java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:3307)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2135)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:482)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:440)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
at jdk.internal.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
Versions that I am trying to use:
<beam.version>2.33.0</beam.version>
<spark.version>3.1.2</spark.version>
<kafka.version>3.0.0</kafka.version>
This issue is resolved by adding VM argument: -Xss2M
This link helped me to solve this issue:
https://github.com/eclipse-openj9/openj9/issues/10370

Spark-shell returning the error : SparkContext: Error initializing SparkContext/Utils: Uncaught exception in thread main

I tried installing Spark on windows 10. I followed the steps in this order:
Installed Java (outside Program Files folder in C drive)
Validated the version of spark downloaded from Apache(spark-3.2.0-bin-hadoop3.2.tgz)
unzip the spark in the folder outside Program files(Installed in C drive)
Downloaded winutils.exe (from GIT which is in the folder Hadoop-3.2.0/bin) and put that in c:/hadoop/bin folder
Set the environment variables for JAVA_HOME (path of java), SPARK_HOME (path of the spark installation), HADOOP_HOME (path of winutils)
Included the PATH variable with %JAVA_HOME%/bin and similarly the other 2.
When I tried running Spark -version, it gives an error Spark is not recognized as the internal or external command. When I run spark-shell, it gives the error SparkContext: Error initializing SparkContext / Utils: Uncaught exception in thread main / ERROR Main: Failed to initialize Spark session.
Could you please let me know if I missed any steps for successful execution? Any suggestions on how to resolve these errors while running spark?

apache_beam spark runner with python can't be implemented on remote spark cluster?

i am following the python guide beam spark runner,and the beam_pipeline can submit job to a local jobserver which is launched by ./gradlew :runners:spark:job-server:runShadow with a local spark,
and the addition parameter-PsparkMasterUrl=spark://localhost:7077 to a pre-deployed spark.
But i have a spark cluster on yarn, i set the launch command as ./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=yarn(also tried yarn-client), but only get org.apache.spark.SparkException: Could not parse Master URL: 'yarn'
and the source code of the spark runner(beam\sdks\python\apache_beam\runners\portability\spark_runnner.py) shows that:
parser.add_argument('--spark_master_url',
default='local[4]',
help='Spark master URL (spark://HOST:PORT). '
'Use "local" (single-threaded) or "local[*]" '
'(multi-threaded) to start a local cluster for '
'the execution.')
it doesn't mention 'yarn', and the Provided SparkContext and StreamingListeners are not supported on the Spark portable runner. So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly) and can only be test locally? or maybe i can set the job_endpoint as the remote job server url of my spark cluster.
and the every ./gradlew command blocked at 98%,but the jab server started with info like that:
19/11/28 13:47:48 INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver: JobService started on localhost:8099
<============-> 98% EXECUTING [16s]
> IDLE
> :runners:spark:job-server:runShadow
> IDLE
So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly)
We've recently added portable Spark jars, which can be submitted via spark-submit. This feature isn't scheduled be included a Beam release until 2.19.0, however.
I created a JIRA ticket to track the status of YARN support, in case there are other related issues that need to be addressed.
and the every ./gradlew command blocked at 98%
That's expected behavior. The job server will stay running until canceled.

spark error: in state DEFINE instead of RUNNING

i'm using spark-shell to run spark hbase script.
When i run this command :
val job = Job.getInstance(conf)
I got this error
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
The error is due to running in spark-shell. Please use spark-submit, this should solve your problem.

Resources