Zeppelin: How to add python files in PYTHONPATH - apache-spark

I am running zeppelin with Spark on yarn.
Option --py-files(SPARK_SUBMIT_OPTIONS) does not work in zeppelin. Is there any alternative to --py-files in zeppelin.
NOTE: I can upload files using option: --files but then it does not add those files in PYTHONPATH. Hence I need an alternative to --py-files in zeppelin.

I'm not familiar with Zeppelin, but you can achieve the equivalent of --py-files by using the addPyFile method on the context (See https://spark.apache.org/docs/1.6.1/api/python/pyspark.html). HTH

Related

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

how to add third party library to spark running on local machine

i am listening to eventhub stream and have seen article to attach library to cluster(databricks) and my code runs file.
For debugging i am running the code on local machine/cluster, but it fails for missing library. How can i add library when running on local machine.
i tried sparkcontext.addfile(fullpathtojar), but still same error.
You can use spark-submit --packages
Example: spark-submit --packages org.postgresql:postgresql:42.1.1
You would need to find the package that you are using and check the compatibility with spark.
With a single jar file you'd use spark-submit --jars instead.
i used spark-submit --packages {package} and it works.

Error: Unrecognized option: --packages

I'm porting an existing script from BigInsights to Spark on Bluemix. I'm trying to run the following against Spark on Bluemix:
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
--master https://x.x.x.x:8443 --jars ./truststore.jar \
--packages org.elasticsearch:elasticsearch-spark_2.10:2.3.0 \
./export_to_elasticsearch.py ...
However, I get the following error:
Error: Unrecognized option: --packages
How can I pass the --packages parameter?
Bluemix uses a customized Spark version, with a customized spark-submit.sh script that only supports a subset of the original script parameters. You can see all the configuration properties and parameters you can use on its documentation.
Additionally, you can download the Bluemix version of the script from this link, and there you can see that there is no argument --packages.
Therefore, the problem with your approach is that the Bluemix version of spark-submit does not accept the --packages parameter, probably due to security reasons. However, alternatively, you can download the jar for the package you want (and maybe a fat jar for the dependencies) and upload them using the --jars parameter. Note: To avoid the necessity of uploading the jar files each time you call spark-submit, you can pre-upload them using curl. The details of this procedure can be found on this link.
Adding to Daniel's post, while using the method to pre-upload your package, you might want to upload your package to "${cluster_master_url}/tenant/data/libs", since Spark service sets these four spark properties "spark.driver.extraClassPath", "spark.driver.extraLibraryPath", "spark.executor.extraClassPath", and "spark.executor.extraLibraryPath" to ./data/libs/*
Reference: https://console.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic3.html#spark-submit_properties

How to lauch prorams in Apache spark?

I have a “myprogram.py” and my “myprogram.scala” that I need to run on my spark machine. How Can I upload and launch them?
I have been using shell to do my transformation and calling actions, but now I want to launch a complete program on spark machine instead of entering single commands every time. Also I believe that will make it easy for me to make changes to my program instead of starting to enter commands in shell.
I did standalone installation in Ubuntu 14.04, on single machine, not a cluster, used spark 1.4.1.
I went through spark docs online, but I only find instruction on how to do that on cluster. Please help me on that.
Thank you.
The documentation to do this (as commented above) is available here: http://spark.apache.org/docs/latest/submitting-applications.html
However, the code you need is here:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
You'll need to compile the scala file using sbt (documentation here: http://www.scala-sbt.org/0.13/tutorial/index.html)
Here's some information on the build.sbt file you'll need in order to grab the right dependencies: http://spark.apache.org/docs/latest/quick-start.html
Once the scala file is compiled, you'll send the resulting jar using the above submit command.
Put it simply:
In Linux terminal, cd to the directory that spark is unpacked/installed
Note, this folder normally contains subfolders like “bin”, “conf”, “lib”, “logs” and so on.
To run the Python program locally with simple/default settings, type command
./bin/spark-submit --master local[*] myprogram.py
More complete descriptions are here like zero323 and ApolloFortyNine described.

Automatically including jars to PySpark classpath

I'm trying to automatically include jars to my PySpark classpath. Right now I can type the following command and it works:
$ pyspark --jars /path/to/my.jar
I'd like to have that jar included by default so that I can only type pyspark and also use it in IPython Notebook.
I've read that I can include the argument by setting PYSPARK_SUBMIT_ARGS in env:
export PYSPARK_SUBMIT_ARGS="--jars /path/to/my.jar"
Unfortunately the above doesn't work. I get the runtime error Failed to load class for data source.
Running Spark 1.3.1.
Edit
My workaround when using IPython Notebook is the following:
$ IPYTHON_OPTS="notebook" pyspark --jars /path/to/my.jar
You can add the jar files in the spark-defaults.conf file (located in the conf folder of your spark installation). If there is more than one entry in the jars list, use : as separator.
spark.driver.extraClassPath /path/to/my.jar
This property is documented in https://spark.apache.org/docs/1.3.1/configuration.html#runtime-environment
As far as I know, you have to import jars to both driver AND executor. So, you need to edit conf/spark-defaults.conf adding both lines below.
spark.driver.extraClassPath /path/to/my.jar
spark.executor.extraClassPath /path/to/my.jar
When I went through this, I did not need any other parameters. I guess you will not need them too.
Recommended way since Spark 2.0+ is to use
spark.driver.extraLibraryPath
and spark.executor.extraLibraryPath
https://spark.apache.org/docs/2.4.3/configuration.html#runtime-environment
ps. spark.driver.extraClassPath and spark.executor.extraClassPath are still there,
but deprecated and will be removed in a future release of Spark.

Resources