spark-submit with external python packages - apache-spark

I am using the python textblob package to calculate the sentiment in a spark job. In order to make the textblob package available on the spark cluster, I created a virtual environment, installed the textblob package in there and packed it using venv-pack. The resulting archive is then used as a parameter for the --archives flag of spark-submit. Eventually, I set the environment variables according to the python package management section of the spark documentation (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv):
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py
When I execute the spark-submit command, the application tells me that it could not find the textblob package.
I am trying to make this setup work for days. I already tried using conda, pex and the --py-files option but none of it worked. Am I missing something?
Eventually, I ended up installing textblob directly on the spark-master machine which seems to work, but I'd rather stick to the documented way.

Related

Spark on kubernetes with zeppelin

I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?
EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Properly configuring PySpark and Anaconda3 on Linux

Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).

Setting python in workers in SPARK YARN with anaconda

I went through this post setting python path for workers/drivers in standalone spark mode. Apparently, the straightforward way is to direct PYSPARK_PATh environment variable in ./conf/spark-env.sh file located in the conf folder of spark such as /opt/cloudera/parcels/CDH/lib/spark/conf/ in my case. However, I was finding to repeat it for spark in YARN cluster mode. Tried playing around for quite some time. I found this cloudera blog to add Anaconda package.
Now all that is left to do, is add the Anaconda path in the spark-env.sh file instead of the standard python path. It finally worked. Please share if there is a better/alternative way for python setup/update in SPARK and pyspark.

How to add spark-csv package to jupyter server on Azure for use with iPython

I want to use the spark-csv package from https://github.com/databricks/spark-csv from within the jupyter service running on Spark HDInsight cluster on Azure.
From local cluster I know I can do this like:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
However I don't understand/know where to put this in the Azure spark configuration.. Any clues hints are appreciated.
You can use the %%configure magic to add any required external package.
It should be as simple as putting the following snippet in your first code cell.
%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
This specific example is also covered in the documentation. Just make sure you start Spark session after the %%configure cell.
One option for managing Spark packages in a cluster from a Jupyter notebook is Apache Toree. Toree gives you some extra line magics that allow you to manage Spark packages from within a Jupyter notebook. For example, inside a Jupyter scala notebook, you would install spark-csv with
%AddDeps com.databricks spark-csv_2.11 1.4.0 --transitive
To install Apache Toree on your Spark clusters, ssh into you Spark clusters and run,
sudo pip install --pre toree
sudo jupyter toree install \
--spark_home=$SPARK_HOME \
--interpreters=PySpark,SQL,Scala,SparkR
I know you specifically asked about Jupyter notebooks running PySpark. At this time, Apache Toree is an incubating project. I have run into trouble using the provided line magics with pyspark notebooks specifically. Maybe you will have better luck. I am looking into why this is, but personally, I prefer Scala in Spark. Hope this helps!
You can try to execute your two lines of code (export ...) in a script that you can invoke in Azure at the time of creation of the HDInsight cluster.
Since you are using HDInsight, you can use a "Script Action" on the Spark cluster load that imports the needed libraries. The script can be a very simple shell script and it can be automatically executed on startup, and automatically re-executed on new nodes if the cluster is resized.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/

Resources