Running Python app on Spark with Conda dependencies - apache-spark

I am trying to run a Python script in Spark. I am running Spark in client mode (i.e. single node) with a Python script that has some dependencies (e.g. pandas) installed via Conda. There are various resources which cover this usage case, for example:
https://conda.github.io/conda-pack/spark.html
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
Using those as an example I run Spark via the following command in the Spark bin directory, where /tmp/env.tar is the Conda environment packed by conda-pack:
export PYSPARK_PYTHON=./environment/bin/python
./spark-submit --archives=/tmp/env.tar#environment script.py
Spark throws the following exception:
java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
Why does this not work? I am curious also about the ./ in the Python path as it's not clear where Spark unpacks the tar file. I assumed I did not need to load the tar file into HDFS since this is all running on a single node (but perhaps I do for cluster mode?).

Related

where is local hadoop folder in pyspark (mac)

I have installed pyspark in local mac using homebrew. I am able to see spark under /usr/local/Cellar/apache-spark/3.2.1/
but not able to see hadoop folder. If I run pyspark in terminal it is running spark shell.
Where can I see its path?
I a trying to connect S3 to pyspark and I have dependency jars
You do not need to know the location of Hadoop to do this.
You should use a command like spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 app.py instead, which will pull all necessary dependencies rather than download all JARs (with their dependencies) locally.

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?
EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Spark Standalone and Virtual Environments

With a spark cluster configured as spark-standalone, we are trying to configure spark-submit jobs to utilize virtual environments managed by pipenv.
The project has this structure:
project/
|-- .venv/
|--bin/python
|--lib/python3.6/site-packages
|-- src/
|-- app.py
The current attempt involves zipping the virtual environment (zip -r site.zip .venv) to include the python executable and all site packages, and ship that along to the executors.
The spark-submit command is currently:
PYSPARK_DRIVER_PYTHON=./.venv/bin/python \
spark-submit --py-files site.zip src/app.py
The thinking is that the --py-files argument should be unzipping the site.zip into the working directory on the executors, and .venv should be reproduced with the .venv/bin/python and site-packages available on the python path. This is clearly not the case as we are receiving the error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task
0.3 in stage 0.0 (TID 3, [executor-node-uri], executor 0):
java.io.IOException: Cannot run program "./.venv/bin/python":
error=2, No such file or directory
My question is: is our understanding of --py-files correct? I tried browsing the spark source code, but could not follow the flow of the --py-files argument in the case that it is a zip file. There are a number of tutorials for YARN mode and shipping conda environments in spark-submit, but not much on spark standalone; Is this even possible?
Addendum::These are the YARN tutorial I was learning from:
https://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/
http://quasiben.github.io/blog/2016/4/15/conda-spark/
The --py-files option will not unpack a zip file you provide it. The reason that python can use packages in a zip file is because python supports zips directly. However if the Python binary itself is packaged in such a way then Spark will not be able to locate it.
To achieve this instead you should use the (terribly documented) --archives option, which will unzip the archive you provide to a directory you specify:
PYSPARK_DRIVER_PYTHON=./.venv/bin/python \
spark-submit \
--archives site.zip#.venv \
src/app.py
The rather weird # syntax is used to specify an output directory, documented here.
edit: there's also a tutorial on using venv-pack to achieve the same thing here though what you're doing should already work.

Spark on windows 10 not working

Im trying to get spark working on win10. When i try to run spark shell i get this error :
'Spark\spark-2.0.0-bin-hadoop2.7\bin..\jars""\ is not recognized as an internal or external command,operable program or batch file.
Failed to find Spark jars directory. You need to build Spark before running this program.
I am using a pre-built spark for hadoop 2.7 or later. I have installed java 8, eclipse neon, python 2.7, scala 2.11, gotten winutils for hadoop 2.7.1 And i still get this error.
When I donwloaded spark it comes in the tgz, when extracted there is another tzg inside, so i extracted it also and then I got all the bin folders and stuff. I need to access spark-shell. Can anyone help?
EDIT:
Solution i ended up using:
1) Virtual box
2) Linux mint
I got the same error while building Spark. You can move the extracted folder to C:\
Refer this:
http://techgobi.blogspot.in/2016/08/configure-spark-on-windows-some-error.html
You are probably giving the wrong folder path to Spark bin.
Just open the command prompt and change directory to the bin inside the spark folder.
Type spark-shell to check.
Refer: Spark on win 10
"On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces."
OR
If you have installed Spark under “C:\Program Files (x86)..” replace 'Program Files (x86)' with Progra~2 in the PATH env variable and SPARK_HOME user variable.

What path do I use for pyspark?

I have spark installed. And, I can go into the bin folder within my spark version, and run ./spark-shell and it runs correctly.
But, for some reason, I am unable to launch pyspark and any of the submodules.
So, I go into bin and launch ./pyspark and it tells me that my path is incorrect.
The current path I have for PYSPARK_PYTHON is the same as where I'm running the pyspark executable script from.
What is the correct path for PYSPARK_PYTHON? Shouldn't it be the path that leads to the executable script called pyspark in the bin folder of the spark version?
That's the path that I have now, but it tells me env: <full PYSPARK_PYTHON path> no such file or directory. Thanks.
What is the correct path for PYSPARK_PYTHON? Shouldn't it be the path that leads to the executable script called pyspark in the bin folder of the spark version?
No, it shouldn't. It should point to a Python executable you want to use with Spark (for example output from which python. If you don't want to use custom interpreter just ignore it. Spark will use the first Python interpreter available on your system PATH.

Resources