Automatically including jars to PySpark classpath - apache-spark

I'm trying to automatically include jars to my PySpark classpath. Right now I can type the following command and it works:
$ pyspark --jars /path/to/my.jar
I'd like to have that jar included by default so that I can only type pyspark and also use it in IPython Notebook.
I've read that I can include the argument by setting PYSPARK_SUBMIT_ARGS in env:
export PYSPARK_SUBMIT_ARGS="--jars /path/to/my.jar"
Unfortunately the above doesn't work. I get the runtime error Failed to load class for data source.
Running Spark 1.3.1.
Edit
My workaround when using IPython Notebook is the following:
$ IPYTHON_OPTS="notebook" pyspark --jars /path/to/my.jar

You can add the jar files in the spark-defaults.conf file (located in the conf folder of your spark installation). If there is more than one entry in the jars list, use : as separator.
spark.driver.extraClassPath /path/to/my.jar
This property is documented in https://spark.apache.org/docs/1.3.1/configuration.html#runtime-environment

As far as I know, you have to import jars to both driver AND executor. So, you need to edit conf/spark-defaults.conf adding both lines below.
spark.driver.extraClassPath /path/to/my.jar
spark.executor.extraClassPath /path/to/my.jar
When I went through this, I did not need any other parameters. I guess you will not need them too.

Recommended way since Spark 2.0+ is to use
spark.driver.extraLibraryPath
and spark.executor.extraLibraryPath
https://spark.apache.org/docs/2.4.3/configuration.html#runtime-environment
ps. spark.driver.extraClassPath and spark.executor.extraClassPath are still there,
but deprecated and will be removed in a future release of Spark.

Related

How to add jar files to $SPARK_HOME/jars correctly?

I have used this command and it works fine:
spark = SparkSession.builder.appName('Apptest')\
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.5').getOrCreate()
But I'd like to download the jar file and always start with:
spark = SparkSession.builder.appName('Apptest').getOrCreate()
How can I do it? I have tried:
Move to SPARK_HOME jar dir:
cd /de/spark-2.4.6-bin-hadoop2.7/jars
Download jar file
curl https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.3.5/mongo-spark-connector_2.11-2.3.5.jar --output mongo-spark-connector_2.11-2.3.5.jar
But spark don't see it. I got the following error:
Py4JJavaError: An error occurred while calling o66.save.
: java.lang.NoClassDefFoundError: com/mongodb/ConnectionString
I know there is ./spark-shell --jar command, but I am using jupyter notebook. Is there some step missing?
Since you're using SparkSession in the jupyter notebook, unfortunately you have to use the .config('spark.jars.packages', '...') to add the jars that you want when you're creating the spark object.
Instead, if you want to add the jar in "default" mode when you launch the notebook, I would recommend you to create a custom kernel, so that every time when you create a new notebook, you even don't need to create the spark. If you're using Anaconda, you can check the docs: https://docs.anaconda.com/ae-notebooks/admin-guide/install/config/custom-pyspark-kernel/
What I was looking for is .config("spark.jars",".."):
spark = SparkSession.builder.appName('Test')\
.config("spark.jars", "/root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar") \
.getOrCreate()
Or:
import os
os.environ["PYSPARK_SUBMIT_ARGS"]="--jars /root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar pyspark-shell"
Also, seems that only put the jar files in $SPARK_HOME/jars works fine as well, but in my case and question example was missing the dependency mongo-java-driver-3.12.5.jar. After download all dependencies in $SPARK_HOME/jars I was able to run only with:
spark = SparkSession.builder.appName('Test').getOrCreate()
I have find out the dependencies in: https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector_2.11/2.3.5

Spark-shell does not import specified jar file

I am a complete beginner to all this stuff in general so pardon if I'm missing some totally obvious step. I installed spark 3.1.2 and cassandra 3.11.11 and I'm trying to connect both of them through this guide I found where I made a fat jar for execution. In the link I posted when they execute the spark-shell command with the jar file, there's a line which occurs at the start.
INFO SparkContext: Added JAR file:/home/chbatey/dev/tmp/spark-cassandra-connector/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-assembly-1.2.0-SNAPSHOT.jar at http://192.168.0.34:51235/jars/spark-15/01/26 16:16:10 INFO SparkILoop: Created spark context..
I followed all of the steps properly but it doesn't show any line like that in my shell. To confirm that it hasn't been added I try the sample program on that website and it throws an error
java.lang.NoClassDefFoundError: com/datastax/spark/connector/util/Logging
What should I do? I'm using spark-cassandra-connector-3.1.0
You don't need to compile it yourself, just follow official documentation - use --packages to automatically download all dependencies:
spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
Your error is that connector file doesn't contain dependencies, you need to list all things, like, java driver, etc. So if you still want to use --jars option, then just download assembly version of it (link to jar) - it will contain all necessary dependencies.

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

Connecting to Teradata using Spark JDBC

I am trying to connect to extract data from Teradata using Spark JDBC. I have created a "lib" directory on the main parent directory and placed the external Teradata jars and ran the sbt package. In addition,I am also providing the "--jars" option on my spark-shell command to provide the jar. However, when I run the spark-shell, it does not seem to find the class
Exception in thread "main" java.lang.ClassNotFoundException: com.teradata.hadoop.tool.TeradataImportTool
However, when I do "jar tvf" on the jar file, I see the class. Somehow the Spark utility is unable to find the jar. Is there anything else I need to do so Spark could find it? Please help
This particular class com.teradata.hadoop.tool.TeradataImportTool is in teradata-hadoop-connector.jar
you can try to pass while submitting job like below example :
--conf spark.driver.extraClassPath complete path of teradata-hadoop-connector.jar
--conf spark.executor.extraClassPath complete path of teradata-hadoop-connector.jar
OR
import jars to both driver & executor. So, you need to edit conf/spark-defaults.conf adding both lines below.
spark.driver.extraClassPath complete path of teradata-hadoop-connector.jar
spark.executor.extraClassPath complete path of teradata-hadoop-connector.jar
NOTE : You can use uber jar is also known as fat jar i.e. jar
with dependencies. as well as alternative approach to avoid this kind
of issue

Zeppelin: How to add python files in PYTHONPATH

I am running zeppelin with Spark on yarn.
Option --py-files(SPARK_SUBMIT_OPTIONS) does not work in zeppelin. Is there any alternative to --py-files in zeppelin.
NOTE: I can upload files using option: --files but then it does not add those files in PYTHONPATH. Hence I need an alternative to --py-files in zeppelin.
I'm not familiar with Zeppelin, but you can achieve the equivalent of --py-files by using the addPyFile method on the context (See https://spark.apache.org/docs/1.6.1/api/python/pyspark.html). HTH

Resources