Set python path for Spark worker - apache-spark

What's the "correct" way to set the sys path for Python worker node?
Is it a good idea for worker nodes to "inherit" sys path from master?
Is it a good idea to set the path in the worker nodes' through .bashrc? Or is there some standard Spark way of setting it?

A standard way of setting environmental variables, including PYSPARK_PYTHON, is to use conf/spark-env.sh file. Spark comes with a template file (conf/spark-env.sh.template) which explains the most common options.
It is a normal bash script so you can use it the same way as you would with .bashrc
You'll find more details in a Spark Configuration Guide.

By the following code you can change the python path only for the current job, which also allow different python path for driver and executors:
PYSPARK_DRIVER_PYTHON=/home/user1/anaconda2/bin/python PYSPARK_PYTHON=/usr/local/anaconda2/bin/python pyspark --master ..

You may do either of the below -
In config,
Update SPARK_HOME/conf/spark-env.sh, add below lines:
# for pyspark
export PYSPARK_PYTHON="path/to/python"
# for driver, defaults to PYSPARK_PYTHON
export PYSPARK_DRIVER_PYTHON="path/to/python"
OR
In the code, add:
import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = 'path/to/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'path/to/python'

The error of my case was that:
Exception: Python in worker has different version 2.6 than that in
driver 2.7, PySpark cannot run with different minor versions
The solution that helped:
export PYSPARK_PYTHON=python2.7
export PYSPARK_DRIVER_PYTHON=python2.7
jupyter notebook
Of course, I installed python2.7 locally on workers.
I suppose it is also important that I also set the PATH.
I did not rely on local workers' settings. The path was inherited from setting the edge node where is jupyter-notebook.

Related

pyspark - can PYTHONPATH be used for the Python interpreter on worker nodes to find Python modules?

Please advise where I should look into to understand the in-detail mechanism on how PySpark finds Python modules on the worker nodes, especially the usage of PHTHONPATH.
PYTHONPATH variable
Environment Variables says environment variables defined in spark-env.sh could be used.
Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed
Then if I define PHTHONPATH in spark-env.sh on all the worker nodes, will PySpark start a Python interpreter process with the PYTHONPATH being passed to the UNIX process, and Python modules will be loaded from the PYTHONPATH?
Precedence of PYTHONPATH when --archives is used
In case PHTHONPATH in spark-env.sh can be used, then what will happen when --archives specifies the virtual environment package?
Python Package Management says a conda environment can be packaged into a tar.gz and passed to the worker nodes.
There are multiple ways to manage Python dependencies in the cluster:
Using PySpark Native Features
Using Conda
Using Virtualenv
Using PEX
Using Conda
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
After that, you can ship it together with scripts or in the code by using the --archives option or spark.archives configuration (spark.yarn.dist.archives in YARN). It automatically unpacks the archive on executors.
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
However when PYTHONPATH is defined in spark-env.sh, will PYTHONPATH be used and which will take precedence either inside the conda environment or PYTHONPATH?

Setting python in workers in SPARK YARN with anaconda

I went through this post setting python path for workers/drivers in standalone spark mode. Apparently, the straightforward way is to direct PYSPARK_PATh environment variable in ./conf/spark-env.sh file located in the conf folder of spark such as /opt/cloudera/parcels/CDH/lib/spark/conf/ in my case. However, I was finding to repeat it for spark in YARN cluster mode. Tried playing around for quite some time. I found this cloudera blog to add Anaconda package.
Now all that is left to do, is add the Anaconda path in the spark-env.sh file instead of the standard python path. It finally worked. Please share if there is a better/alternative way for python setup/update in SPARK and pyspark.

How to specify memory and cpu's for a Jupyter spark/pyspark notebook from command line?

The intention is to achieve something along the lines of
jupyter-notebook --kernel-options="--mem 1024m --cpus 4"
Where kernel-options would be forwarded to the pyspark or spark kernels.
We need this in order to run separate jupyter servers - one for pyspark kernel and one for spark (in scala) kernel on the same machine. That is a requirement since a single jupyter server does not support simultaneous pyspark and (scala) spark kernels running concurrently.
For Jupyter 4.0 and later you should be able to start a Spark-enabled notebooks like this:
pyspark [options]
where [options] is the list of any flags you pass to pyspark.
For this to work, you would need to set following environmental variables in your .profile:
export PYSPARK_DRIVER_PYTHON="/path/to/my/bin/jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON="/path/to/my/bin/python"
Alternatively, if you are using Apache Toree, you could pass them via SPARK_OPTS:
SPARK_OPTS='--master=local[4]' jupyter notebook
More details on Apache Toree setup.

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code:
import findspark
findspark.init()
I receive a Value error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
However the SPARK_HOME variable is set. Here is a screenshot that shows that the list of environmental variables on my system.
Has anyone encountered this issue or would know how to fix this? I only found an old discussion in which someone had set SPARK_HOME to the wrong folder but I don't think it's my case.
I had the same problem and wasted a lot of time. I found two solutions:
There are two solutions
copy downloaded spark folder in somewhere in C directory and give the link as below
import findspark
findspark.init('C:/spark')
use the function of findspark to find automatically the spark folder
import findspark
findspark.find()
The environmental variables get updated only after system reboot. It works after restarting your system.
I had same problem and had it solved by installing "vagrant" and "virtual box". (Note, though I use Mac OS and Python 2.7.11)
Take a look at this tutorial, which is for the Harvard CS109 course :
https://github.com/cs109/2015lab8/blob/master/installing_vagrant.pdf
After "vagrant reload" on the terminal , I am able to run my codes without errors.
NOTE the difference between the result of command "os.getcwd" shown in the attached images.
I had the same problem when installing spark using pip install pyspark findspark in a conda environment.
The solution was to do this:
export /Users/pete/miniconda3/envs/cenv3/lib/python3.6/site-packages/pyspark/
jupyter notebook
You'll have to substitute the name of your conda environment for cenv3 in the command above.
Restarting the system after setting up the environmental variables worked for me.
i have same problem, i solved it by closing cmd then open again. i forget that after editing env variable on windows that should restart cmd..
I got the same error. Initially, I had stored my Spark folder in the Documents directory. Later, when I moved it to the Desktop, it suddenly started recognizing all the system variables and it ran findspark.init() without any error.
Try it out once.
This error may occur, if you don't set the environment variables in .bashrc file. Set your python environment variable as follows:
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
the simplest way i found to use spark with jupyter notebook is
1- download spark
2- unzip to desired location
3- open jupyter notebook in usual way nothing special
4- now run the below code
import findspark
findspark.init("location of spark folder ")
# in my case it is like
import findspark
findspark.init("C:\\Users\\raj24\\OneDrive\\Desktop\\spark-3.0.1-bin-hadoop2.7")

What path do I use for pyspark?

I have spark installed. And, I can go into the bin folder within my spark version, and run ./spark-shell and it runs correctly.
But, for some reason, I am unable to launch pyspark and any of the submodules.
So, I go into bin and launch ./pyspark and it tells me that my path is incorrect.
The current path I have for PYSPARK_PYTHON is the same as where I'm running the pyspark executable script from.
What is the correct path for PYSPARK_PYTHON? Shouldn't it be the path that leads to the executable script called pyspark in the bin folder of the spark version?
That's the path that I have now, but it tells me env: <full PYSPARK_PYTHON path> no such file or directory. Thanks.
What is the correct path for PYSPARK_PYTHON? Shouldn't it be the path that leads to the executable script called pyspark in the bin folder of the spark version?
No, it shouldn't. It should point to a Python executable you want to use with Spark (for example output from which python. If you don't want to use custom interpreter just ignore it. Spark will use the first Python interpreter available on your system PATH.

Resources