Setting python in workers in SPARK YARN with anaconda - apache-spark

I went through this post setting python path for workers/drivers in standalone spark mode. Apparently, the straightforward way is to direct PYSPARK_PATh environment variable in ./conf/spark-env.sh file located in the conf folder of spark such as /opt/cloudera/parcels/CDH/lib/spark/conf/ in my case. However, I was finding to repeat it for spark in YARN cluster mode. Tried playing around for quite some time. I found this cloudera blog to add Anaconda package.
Now all that is left to do, is add the Anaconda path in the spark-env.sh file instead of the standard python path. It finally worked. Please share if there is a better/alternative way for python setup/update in SPARK and pyspark.

Related

where is local hadoop folder in pyspark (mac)

I have installed pyspark in local mac using homebrew. I am able to see spark under /usr/local/Cellar/apache-spark/3.2.1/
but not able to see hadoop folder. If I run pyspark in terminal it is running spark shell.
Where can I see its path?
I a trying to connect S3 to pyspark and I have dependency jars
You do not need to know the location of Hadoop to do this.
You should use a command like spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 app.py instead, which will pull all necessary dependencies rather than download all JARs (with their dependencies) locally.

spark-submit with external python packages

I am using the python textblob package to calculate the sentiment in a spark job. In order to make the textblob package available on the spark cluster, I created a virtual environment, installed the textblob package in there and packed it using venv-pack. The resulting archive is then used as a parameter for the --archives flag of spark-submit. Eventually, I set the environment variables according to the python package management section of the spark documentation (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv):
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py
When I execute the spark-submit command, the application tells me that it could not find the textblob package.
I am trying to make this setup work for days. I already tried using conda, pex and the --py-files option but none of it worked. Am I missing something?
Eventually, I ended up installing textblob directly on the spark-master machine which seems to work, but I'd rather stick to the documented way.

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?
EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Properly configuring PySpark and Anaconda3 on Linux

Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).

How to add spark-csv package to jupyter server on Azure for use with iPython

I want to use the spark-csv package from https://github.com/databricks/spark-csv from within the jupyter service running on Spark HDInsight cluster on Azure.
From local cluster I know I can do this like:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
However I don't understand/know where to put this in the Azure spark configuration.. Any clues hints are appreciated.
You can use the %%configure magic to add any required external package.
It should be as simple as putting the following snippet in your first code cell.
%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
This specific example is also covered in the documentation. Just make sure you start Spark session after the %%configure cell.
One option for managing Spark packages in a cluster from a Jupyter notebook is Apache Toree. Toree gives you some extra line magics that allow you to manage Spark packages from within a Jupyter notebook. For example, inside a Jupyter scala notebook, you would install spark-csv with
%AddDeps com.databricks spark-csv_2.11 1.4.0 --transitive
To install Apache Toree on your Spark clusters, ssh into you Spark clusters and run,
sudo pip install --pre toree
sudo jupyter toree install \
--spark_home=$SPARK_HOME \
--interpreters=PySpark,SQL,Scala,SparkR
I know you specifically asked about Jupyter notebooks running PySpark. At this time, Apache Toree is an incubating project. I have run into trouble using the provided line magics with pyspark notebooks specifically. Maybe you will have better luck. I am looking into why this is, but personally, I prefer Scala in Spark. Hope this helps!
You can try to execute your two lines of code (export ...) in a script that you can invoke in Azure at the time of creation of the HDInsight cluster.
Since you are using HDInsight, you can use a "Script Action" on the Spark cluster load that imports the needed libraries. The script can be a very simple shell script and it can be automatically executed on startup, and automatically re-executed on new nodes if the cluster is resized.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/

Resources