Spark submit fail on call setEntityResolver of XMLConfiguration on Apache Common Configuration - apache-spark

I have a problem when i try to submit my application with spark submit command:
/bin/spark-submit --class MyClass myjar.jar
I set master url programmatically.
I get following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.configuration.XMLConfiguration.setEntityResolver(Lorg/xml/sax/EntityResolver;)V
when i run my program on IDE all work correctly. This problem does not arise.

It looks like the jar you are submitting may not have all of the dependencies it requires in it. The solution to this is to build an assembly jar, see https://maven.apache.org/plugins/maven-assembly-plugin/usage.html (for maven), or https://github.com/sbt/sbt-assembly (for sbt).

I have finally found the cause of the problem.
Spark use hadoop-client-2.2.0 that use hadoop-common that use common-configuration-1.6.
In my application I used common-configuration-v1.10 where XMLConfiguration.setEntityResolver is implemented. Instead in version 1.6 of the library that method is not present.
When i run:
/bin/spark-submit --class MyClass myjar.jar
XMLConfiguration.class of common-configuration-v1.6 is loaded and the JVM does not find the method setEntityResolver
I have resolved using common-configuration-v2.0-beta1 in my application.

Related

Running Spark2.3 on Kubernetes with remote dependency on S3

I am running spark-submit to run on Kubernetes (Spark 2.3). My problem is that the InitContainer does not download my jar file if it's specified as an s3a:// path but does work if I put my jar on an HTTP server and use http://. The spark driver fails, of course, because it can't find my Class (and the jar file in fact is not in the image).
I have tried two approaches:
specifying the s3a path to jar as the argument to spark-submit and
using --jars to specify the jar file's location on s3a, but both fail in the same way.
edit: also, using local:///home/myuser/app.jar does not work with the same symptoms.
On a failed run (dependency on s3a), I logged into the container and found the directory /var/spark-data/spark-jars/ to be empty. The init-container logs don't indicate any type of error.
Questions:
What is the correct way to specify remote dependencies on S3A?
Is S3A not supported yet? Only http(s)?
Any suggestions on how to further debug the InitContainer to determine why the download doesn't happen?

Apache Spark not recognising import from external Jar

I am trying to adapt some code from Apache Zeppelin for a personal project. The idea is to pass Scala source code to be executed in Spark. Everything is working fine, until when I try to use an external jar. For this, I call
SparkConf#setJars(externalJars);
And I can see in the logs that my jar was added:
Added JAR file:/Users/.../lsa.jar at spark://192.168.0.16:60376/jars/lsa.jar with timestamp 1470532825125
And when I check the UI of Spark http://192.168.0.16:4040/environment/ I can see my jar was added with an entry under Classpath Entries:
spark://192.168.0.16:60376/jars/lsa.jar
But when I try to import a class from the JAR I get:
<console>:25: error: object cloudera is not a member of package com
import com.cloudera.datascience.lsa._
^
Does anyone have an idea about what I am missing ?
Edit: I also tried to add the JAR via the spark-defaults.conf:
spark.driver.extraClassPath /Users/.../lsa.jar
but no luck.
I can see here the doc says:
Instead, please set this through the --driver-class-path command line option or in your default properties file.
I don't know where to pass this option, should I do it for the master only or for each slave ?
Thanks in advance

Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight

I have been following this tutorial in order to set up Zeppelin on a Spark cluster (version 1.5.2) in HDInsight, on Linux. Everything worked fine, I have managed to successfully connect to the Zeppelin notebook through the SSH tunnel. However, when I try to run any kind of paragraph, the first time I get the following error:
java.io.IOException: No FileSystem for scheme: wasb
After getting this error, if I try to rerun the paragraph, I get another error:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
These errors occur regardless of the code I enter, even if there is no reference to the hdfs. What I'm saying is that I get the "No FileSystem" error even for a trivial scala expression, such as parallelize.
Is there a missing configuration step?
I am download the tar ball that the script that you pointed to as I type. But want I am guessing is that your zeppelin install and spark install are not complete to work with wasb. In order to get spark to work with wasb you need to add some jars to the Class path. To do this you need to add something like this to your spark-defaults.conf (the paths might be different in HDInsights, this is from HDP on IaaS)
spark.driver.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
spark.executor.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
Once you have spark working with wasb, or next step is make those sames jar in zeppelin class path. A good way to test your setup is make a notebook that prints your env vars and class path.
sys.env.foreach(println(_))
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Also looking at the install script, it trying to pull the zeppelin jar from wasb, you might want to change that config to somewhere else while you try some of these changes out. (zeppelin.sh)
export SPARK_YARN_JAR=wasb:///apps/zeppelin/zeppelin-spark-0.5.5-SNAPSHOT.jar
I hope this helps, if you are still have problems I have some other ideas, but would start with these first.

How to run Apache spark Java program in standalone

I have written a java program for spark, but I am not able to run it from the command line.
I have followed the steps given in the Quick start guide, but I am getting the following error. Please help me out with this problem.
Here is the error :
hadoopnod#hadoopnod:~/spark-1.2.1/bin$ ./run-example "SimpleApp " --master local /home/hadoopnod/Spark_Java/target/simple-project-1.0.jarjava.lang.ClassNotFoundException: org.apache.spark.examples.SimpleApp
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:342)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Create a JAR file using following command. You can find the SimpleApp.class file in "target/classes" folder. cd to this directory.
jar cfve file.jar SimpleApp.class
Put this JAR file into your project in target directory.
This JAR file contains the dependency of your SimpleApp class while submitting your job to Spark.
cd to your spark directory. I am using spark-1.4.0-bin-hadoop2.6. Your cmd looks like this.
spark-1.4.0-bin-hadoop2.6>
Submit your spark program using Spark Submit. If you have structure like Harsha has explained in another answer then provide
--class org.apache.spark.examples.SimpleApp
else
--class SimpleApp
Finally submit your spark program.
spark-1.4.0-bin-hadoop2.6>./bin/spark-submit --class SimpleApp --master local[2] /home/hadoopnod/Spark_Java/target/file.jar
The script ./run-example.sh is used to execute the examples included in the distribution. To run the example "SparkPi" do this...
> cd /apps/spark-1.2.0
> ./bin/run-example SparkPi
If you look at how this script executes its just a new user friendly wrapper which actually calls spark-submit.
Here's an example that executes the same "SparkPi" example from above, but using spark-submit
> .bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/target/spark-examples_2.10-1.2.0.jar
You should use spark-submit to run your own code.
ClassNotFoundException: org.apache.spark.examples.SimpleApp
From the above error, it is clear the reason that it cannot find the class you are trying to execute. Have you bundled your java project into a jar file. If you have any other dependencies while creating your jar file, you need to include them as well.
Assume if you have a project structure like this
simpleapp
- src/main/java
- org.apache.spark.examples
-SimpleApp.java
- lib
- dependent.jars (you can put all dependent jars inside lib directory)
- target
- simpleapp.jar (after compiling your source)
You can use any build tool or any IDE to bundle your source into a Jar file. After that if you have added spark/bin directory into your path. you can execute below command from your project directory. you need to add --jars $(echo lib/*.jar | tr ' ' ',' ) only if you have dependent libraries in your SimpleApp.java
spark-submit --jars $(echo lib/*.jar | tr ' ' ',' ) --class org.apache.spark.examples.SimpleApp --master local[2] target/simpleapp.jar
I had the same issue. If you want to use the command provided by the Spark Quickstart, be sure your project has the same architecture:
find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
It may be not the case for you but my pom.xml built my architecture like
./src/main/java/myGroupId/myArtifactId/SimpleApp.java
I moved my class in default package and it worked fine after.

spark-submit to cloudera cluster can not find any dependent jars

I am able to do a spark-submit to my cloudera cluster. the job dies after a few minutes with exceptions complaining it can not find various classes. These are classes that are in the spark dependency path. I keep adding the jars one at a time using command line args --jars, the yarn log keeps dumping out the next jar it can't find.
What setting allows the spark/yarn job to find all the dependent jars?
I already set the "spark.home" attribute to the correct path - /opt/cloudera/parcels/CDH/lib/spark
I found it!
remove
.set("spark.driver.host", "driver computer ip address")
from your driver code.

Resources