Addingdependency jars to spark classpath in a standalone mode - apache-spark

I m trying to execute my local java program in spark which has dependencies, i tried executing spark submit option as below :
spark-submit --class com.cerner.doc.DocumentExtractor /Users/sp054800/Downloads/Docs_lib_jar/Docs_RestAPI.jar
after setting the
spark.driver.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.driver.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
in spark-defaults.conf, but still no help could anyone help me to fix this of how do i need to include the jars in spark. I m using spark2.2.0

Related

spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found

How can I configure Spark 3.x on HDP 3.1 using headless (https://spark.apache.org/docs/latest/hadoop-provided.html) version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
+---------+
|namespace|
+---------+
| default|
+---------+
spark.conf.get("spark.sql.catalogImplementation")
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
NOTE
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
edit
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).
As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

List all additional jars loaded in pyspark

I want to see the jars my spark context is using.
I found the code in Scala:
$ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0
scala> spark.sparkContext.listJars.foreach(println)
spark://datasci:42661/jars/net.sf.saxon_Saxon-HE-9.6.0-7.jar
spark://datasci:42661/jars/elsevierlabs-os_spark-xml-utils-1.6.0.jar
spark://datasci:42661/jars/org.apache.commons_commons-lang3-3.4.jar
spark://datasci:42661/jars/commons-logging_commons-logging-1.2.jar
spark://datasci:42661/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar
spark://datasci:42661/jars/commons-io_commons-io-2.4.jar
Source: List All Additional Jars Loaded in Spark
But I could not find how to do it in PySpark.
Any suggestions?
Thanks
sparkContext._jsc.sc().listJars()
_jsc is the java spark context
I really got the extra jars with this command:
print(spark.sparkContext._jsc.sc().listJars())

Dependency is not added to Spark + Zeppelin

I can't add custom dependency to the spark classpath from zeppelin.
Environment:
AWS EMR: Zeppelin 0.8.0, Spark 2.4.0
extra configs for spark interpreter:
spark.jars.ivySettings /tmp/ivy-settings.xml
spark.jars.packages my-group-name:artifact_2.11:version
The files from my-group-name were appeared at
spark.yarn.dist.jars
spark.yarn.secondary.jars
But not accessible via zeppelin notebook (checking by import my.lab._)
However, when i am running the same configs for spark-shell it is working on both local machine, and ssh on emr cluster
and imports are available from spark-shell
Sun.java.command for zeppelin:
org.apache.spark.deploy.SparkSubmit --master yarn-client ... --conf spark.jars.packages=my-group-name:artifact_2.11:version ... --conf spark.jars.ivySettings=/tmp/ivy-settings.xml ... --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/lib/zeppelin/interpreter/spark/spark-interpreter-0.8.0.jar <IP ADDRESS> 34717 :
Spark submit on emr:
spark-shell --master yarn-client --conf spark.jars.ivySettings="/tmp/ivy-settings.xml" --conf spark.jars.packages="my-group-name:artifact_2.11:version"
Any advices where to look for the errors?
You can try to add your jar directly to Zeppelin, in Interpreter settings.
http://zeppelin.apache.org/docs/0.8.0/usage/interpreter/dependency_management.html
Or, add jar to spark libs (in my case it's /usr/hdp/current/spark2/jars/ directory).

Dependency is not distributed to Spark cluster

I'm trying to execute Spark job on Mesos cluster that depends on spark-cassandra-connector library, but it keeps failing with
Exception in thread "main" java.lang.NoClassDefFoundError: com/datastax/spark/connector/package$
As I understand from spark documentation
JARs and files are copied to the working directory for each SparkContext on the executor nodes.
...
Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages.
But it seems that only pucker-assembly-1.0.jar task jar is distributed.
I'm running spark 1.6.1 with scala 2.10.6.
And here's spark-submit command I'm executing:
spark-submit --deploy-mode cluster
--master mesos://localhost:57811
--conf spark.ssl.noCertVerification=true
--packages datastax:spark-cassandra-connector:1.5.1-s_2.10
--conf spark.cassandra.connection.host=10.0.1.83,10.0.1.86,10.0.1.85
--driver-cores 3
--driver-memory 4000M
--class SimpleApp
https://dripit-spark.s3.amazonaws.com/pucker-assembly-1.0.jar
s3n://logs/E1SR85P3DEM3LU.2016-05-05-11.ceaeb015.gz
So why isn't spark-cassandra-connector distributed to all my spark executers?
You should use the correct Maven coordinate syntax:
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0
See
https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
http://spark.apache.org/docs/latest/submitting-applications.html
http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

Add extra classpath to executors in Spark client mode

I'm using Spark 1.5.1 with the standalone cluster manager. Spark's default spark-assembly-1.5.1-hadoop2.6.0.jar includes Avro 1.7.7. I want to use my custom Avro library for all my Spark jobs, let's call it Avro 1.7.8. This works perfectly in dev mode (master=local[*]). However, when I submit my app to the cluster in client mode, the executors still use Avro 1.7.7 library.
URL url = getClass().getClassLoader().getResource(GenericData.class.getName().replace('.','/')+".class");
When I print this, my executor's log shows :
/opt/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar/org/apache/avro/generic/GenericData.class
Here is a part of my spark-env.sh on the worker node :
export SPARK_WORKER_OPTS="-Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true
Here is my worker process on the worker node (ps aux | grep worker) :
spark 955 1.8 1.9 4161448 243600 ? Sl 13:29 0:09 /usr/java/jdk1.7.0_79/jre/bin/java -cp /home/ansible/avro-1.7.8.jar:/etc/spark-worker/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://spark-a-01:7077
Obviously, I put this jar : /home/ansible/avro-1.7.8.jar in all my worker nodes.
Does anyone knows how to force the executor to use my jar instead of the spark assembly's one ?
Try using the --packages option to spark-submit:
spark-submit --packages org.apache.avro:avro:1.7.8 ....
Something like that. If you're not using spark-submit, use it -- this is exactly what it is for.

Resources