List all additional jars loaded in pyspark - apache-spark

I want to see the jars my spark context is using.
I found the code in Scala:
$ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0
scala> spark.sparkContext.listJars.foreach(println)
spark://datasci:42661/jars/net.sf.saxon_Saxon-HE-9.6.0-7.jar
spark://datasci:42661/jars/elsevierlabs-os_spark-xml-utils-1.6.0.jar
spark://datasci:42661/jars/org.apache.commons_commons-lang3-3.4.jar
spark://datasci:42661/jars/commons-logging_commons-logging-1.2.jar
spark://datasci:42661/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar
spark://datasci:42661/jars/commons-io_commons-io-2.4.jar
Source: List All Additional Jars Loaded in Spark
But I could not find how to do it in PySpark.
Any suggestions?
Thanks

sparkContext._jsc.sc().listJars()
_jsc is the java spark context

I really got the extra jars with this command:
print(spark.sparkContext._jsc.sc().listJars())

Related

spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found

How can I configure Spark 3.x on HDP 3.1 using headless (https://spark.apache.org/docs/latest/hadoop-provided.html) version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
+---------+
|namespace|
+---------+
| default|
+---------+
spark.conf.get("spark.sql.catalogImplementation")
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
NOTE
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
edit
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).
As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

Error: while writing pyspark dataframe into Habse

I am trying to write the pyspark dataframe into Hbase. Facing below error.
Spark and Hbase version on my cluster are:
Spark Version: 2.4.0
Hbase Version: 1.4.8
Spark Submit
spark-submit --jars /tmp/hbase-spark-1.0.0.jar --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml to_hbase.py
error:
Any help would be much appreciated!
It is a known problem - using spark-habse-connector (shc) with spark2.4.
There is a fix #dhananjay_patka did.
Check: SHC With spark 2.4
and his fix

Addingdependency jars to spark classpath in a standalone mode

I m trying to execute my local java program in spark which has dependencies, i tried executing spark submit option as below :
spark-submit --class com.cerner.doc.DocumentExtractor /Users/sp054800/Downloads/Docs_lib_jar/Docs_RestAPI.jar
after setting the
spark.driver.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.driver.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
in spark-defaults.conf, but still no help could anyone help me to fix this of how do i need to include the jars in spark. I m using spark2.2.0

Error starting pyspark with options (Without Spack packages)

can anyone tell me why I'm getting the errors below ? According to the
README for the pyspark-cassandra connector, what I am trying below should work (without Spark packages): https://github.com/TargetHolding/pyspark-cassandra
$ pyspark_jar="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/scala-2.10/pyspark-cassandra-assembly-0.2.2.jar"
$ pyspark_egg="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/pyspark_cassandra-0.2.2-py2.7.egg"
$ pyspark --jars $pyspark_jar --py_files $pyspark_egg --conf spark.cassandra.connection.host=localhost
This results in:
Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:222)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:239)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:113)
at org.apache.spark.launcher.Main.main(Main.java:74)
Figured out the problem. I needed to use
--py-files
instead of
--py_files

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job.
spark-submit --class com.biz.test \
--packages \
org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
org.apache.hbase:hbase-common:1.0.0 \
org.apache.hbase:hbase-client:1.0.0 \
org.apache.hbase:hbase-server:1.0.0 \
org.json4s:json4s-jackson:3.2.11 \
./test-spark_2.10-1.0.8.jar \
>spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.
According to the docs (spark-submit --help):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?
A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
org.apache.hbase:hbase-common:1.0.0
I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()
#Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:
spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())

Resources