spark-submit dependency conflict - apache-spark

I'm trying to submit a jar to spark but my jar contains dependencies that conflict with spark's built-in jars (snakeyml and others).
Is there a way to tell spark to prefer whatever dependencies my project has over the jars inside /jar
UPDATE
When i run spark-submit, i get the following exception:
Caused by: java.lang.NoSuchMethodError: javax.validation.BootstrapConfiguration.getClockProviderClassName()Ljava/lang/String;
at org.hibernate.validator.internal.xml.ValidationBootstrapParameters.<init>(ValidationBootstrapParameters.java:63)
at org.hibernate.validator.internal.engine.ConfigurationImpl.parseValidationXml(ConfigurationImpl.java:540)
at org.hibernate.validator.internal.engine.ConfigurationImpl.buildValidatorFactory(ConfigurationImpl.java:337)
at javax.validation.Validation.buildDefaultValidatorFactory(Validation.java:110)
at org.hibernate.cfg.beanvalidation.TypeSafeActivator.getValidatorFactory(TypeSafeActivator.java:501)
at org.hibernate.cfg.beanvalidation.TypeSafeActivator.activate(TypeSafeActivator.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.hibernate.cfg.beanvalidation.BeanValidationIntegrator.integrate(BeanValidationIntegrator.java:132)
... 41 more
which is caused by spark having an older version of validation-api (validation-api-1.1.0.Final.jar)
My project has a dependency on the newer version and it does get bundled with my jar (javax.validation:validation-api:jar:2.0.1.Final:compile)
I submit using this command:
/spark/bin/spark-submit --conf spark.executor.userClassPathFirst=true --conf spark.driver.userClassPathFirst=true
but i still get the same exception

If you are building your jar using SBT, you need to exclude those classes which are on the cluster. For example like below:
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided"
You are doing that by adding "provided", that means these classes is provided already in the environment where you run it.

Not sure if using SBT, but I used this in build.sbt via assembly as I had also sorts of dependency conflicts at one stage. See below, maybe this will help.

This is controlled by setting the following confs to true:
spark.driver.userClassPathFirst
spark.executor.userClassPathFirst

I had issues with 2 jars, and this is what I ended up doing, ie copied the required jars to a directory, and used the extraClasspath option
spark-submit --conf spark.driver.extraClassPath="C:\sparkjars\validation-api-2.0.1.Final.jar;C:\sparkjars\gson-2.8.6.jar" myspringbootapp.jar
From the documentaion, spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.

Related

Spark job can't connect to cassandra when ran from a jar

I have spark job that writes data to Cassandra(Cassandra is on GCP). When I run this from IntelliJIDEA (my IDE) it works perfectly fine. The data is perfectly sent and written to Cassandra. However, this fails when I package my project into a fat jar and run it.
Here is an example of how I run it.
spark-submit --class com.testing.Job --master local out/artifacts/SparkJob_jar/SparkJob.jar 1 0
However, this fails for me and gives me the following errors
Caused by: java.io.IOException: Failed to open native connection to Cassandra at {X.X.X:9042} :: 'com.datastax.oss.driver.api.core.config.ProgrammaticDriverConfigLoaderBuilder com.datastax.oss.driver.api.core.config.DriverConfigLoader.programmaticBuilder()'
Caused by: java.lang.NoSuchMethodError: 'com.datastax.oss.driver.api.core.config.ProgrammaticDriverConfigLoaderBuilder com.datastax.oss.driver.api.core.config.DriverConfigLoader.programmaticBuilder()'
My artifacts file does include the spark-Cassandra files
spark-cassandra-connector-driver_2.12-3.0.0-beta.jar
spark-cassandra-connector_2.12-3.0.0-beta.jar
I'm wondering why this is happening and how I can fix it?
The problem is that besides that 2 things, you need to have more jars - full Java driver, and its dependencies. You have following possibilities to fix that:
You need to make sure that these artifact is packaged into the resulting jar (so-called "fat jar" or an "assembly") using Maven or SBT, or anything else
you can can specify maven coordinates com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta with --packages like this --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta
you can download spark-cassandra-connector-assembly artifact to the node from which you're doing spark-submit, and then use that file name with --jars
See the documentation for Spark Cassandra Connector for more details.

How to use different Spark version (Spark 2.4) on YARN cluster deployed with Spark 2.1?

I have a Hortonworks yarn cluster with Spark 2.1.
However I want to run my application with spark 2.3+ (because an essential third-party ML library in use needs it).
Do we have to use spark-submit from the Spark 2.1 version or we have to submit job to yarn using Java or Scala with a FAT jar? Is this even possible? What about Hadoop libraries?
On a Hortonworks cluster, running a custom spark version in yarn client/cluster mode needs following steps:
Download Spark prebuilt file with appropriate hadoop version
Extract and unpack into a spark folder. eg. /home/centos/spark/spark-2.3.1-bin-hadoop2.7/
Copy jersey-bundle 1.19.1 jar into spark jar folder [Download from here][1]
Create a zip file containing all the jars in spark jar folder. Spark-jar.zip
Put this spark-jar.zip file in a world accessible hdfs location such as (hdfs dfs -put spark-jars.zip /user/centos/data/spark/)
get hdp version (hdp-select status hadoop-client): eg output. hadoop-client - 3.0.1.0-187
Use the above hdp version in export commands below
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.0.1.0-187/hadoop/conf}
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.0.1.0-187/hadoop}
export SPARK_HOME=/home/centos/spark/spark-2.3.1-bin-hadoop2.7/
Edit the spark-defaults.conf file in spark_home/conf directory, add following entries
spark.driver.extraJavaOptions -Dhdp.version=3.0.1.0-187
spark.yarn.am.extraJavaOptions -Dhdp.version=3.0.1.0-187
create java-opts file in spark_home/conf directory, add below entries and use the above mentioned hdp version
-Dhdp.version=3.0.1.0-187
export LD_LIBRARY_PATH=/usr/hdp/3.0.1.0-187/hadoop/lib/native:/usr/hdp/3.0.1.0-187/hadoop/lib/native/Linux-amd64-64
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs:///user/centos/data/spark/spark-jars.zip
I assume you use sbt as the build tool in your project. The project itself could use Java or Scala. I also think that the answer in general would be similar if you used gradle or maven, but the plugins would simply be different. The idea is the same.
You have to use an assembly plugin (e.g. sbt-assembly) that is going to bundle all non-Provided dependencies together, including Apache Spark, in order to create a so-called fat jar or uber-jar.
If the custom Apache Spark version is part of the application jar that version is going to be used whatever spark-submit you use for deployment. The trick is to trick the classloader so it loads the jars and classes of your choice not spark-submit's (and hence whatever is used in the cluster).

issue with ignite jars for spark

I am trying to use ignite 2.4 with Spark 2.1.
I add the path of the following libs from the binary of ignite at spark-shell
--conf spark.diver.extraClassPath=/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/*:/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/optional/*:/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/ignite-indexing/*:/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/ignite-spring/* --conf spark.executor.extraClassPath=/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/*:/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/optional/*:/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/ignite-indexing/*:/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/ignite-spring/*
But, I cannot import any of libs such as
import org.apache.ignite.configuration._
error: object ignite is not a member of package org.apache
How should I resolve this?
/home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/optional/*: - Ignite contains many subfolders in optional, but java doesn't include subfolders into classpath. To include spark jars you should add /home/sshuser/apache-ignite-fabric-2.4.0-bin/libs/optional/ignite-spark/* into classpath.
Please, read official documentation - https://apacheignite-fs.readme.io/v2.4/docs/installation-deployment

Apache beam word count example with spark runner fails with " Unknown 'runner' specified 'SparkRunner'"

I am trying to do spark-submit of the Apache beam word-count example by giving the below command
spark-submit --class org.apache.beam.examples.WordCount word-count-beam-0.1.jar --inputFile=pom.xml --output=counts --runner=SparkRunner
I get the below Exception:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown
'runner' specified 'SparkRunner', supported pipeline runners
[DirectRunner]
Your pom.xml needs to include a dependency on the Spark runner. The documentation on using the Spark runner includes more details about what is necessary.
Looks like you're not building an Uber-jar with the necessary Spark dependencies.
Re-run your Maven package as follows:
mvn package -Pspark-runner
This will build a Jar in target containing the wordcount classes as well as all of the necessary spark dependencies called something like:
word-count-beam-bundled-0.1.jar
Then use that jar in the spark-submit command

Argument list too long when run spark on yarn

I am trying to migrate our application to spark running on yarn. I use cmdline as spark-submit --master yarn --deploy-mode cluster -jars ${my_jars}...
But yarn throws Expections with the following log:
Container id: container_1462875359170_0171_01_000002
Exit code: 1
Exception message: .../launch_container.sh: line 4145: /bin/bash: Argument list too long
I think the reason may be that we have too many jars (684 jars separated by comma) specified by option --jars ${my_jars}, my question is what is the graceful way to specify all our jars? Or how can we avoid this yarn error?
Check if you can use spark.driver.extraClassPath extraClassPath Spark Documentation
spark.driver.extraClassPath /fullpath/firs.jar:/fullpath/second.jar
spark.executor.extraClassPath /fullpath/firs.jar:/fullpath/second.jar
Just found the threadspark-submit-add-multiple-jars-in-classpath
I'd try these two things
Build a fat jar for spark submit application or
Build a thin jar with maven and install unavailable jars in the maven repo. so that it will be available to load at runtime in the cluster.
Try sbt-assembly which packages all your classes and dependency classes into an uber jar.
It is very easy and comfortable to use, but you have to take care of two things:
version conflict
the jar would be a little bit large

Resources