Spark Standalone and Virtual Environments - apache-spark

With a spark cluster configured as spark-standalone, we are trying to configure spark-submit jobs to utilize virtual environments managed by pipenv.
The project has this structure:
project/
|-- .venv/
|--bin/python
|--lib/python3.6/site-packages
|-- src/
|-- app.py
The current attempt involves zipping the virtual environment (zip -r site.zip .venv) to include the python executable and all site packages, and ship that along to the executors.
The spark-submit command is currently:
PYSPARK_DRIVER_PYTHON=./.venv/bin/python \
spark-submit --py-files site.zip src/app.py
The thinking is that the --py-files argument should be unzipping the site.zip into the working directory on the executors, and .venv should be reproduced with the .venv/bin/python and site-packages available on the python path. This is clearly not the case as we are receiving the error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task
0.3 in stage 0.0 (TID 3, [executor-node-uri], executor 0):
java.io.IOException: Cannot run program "./.venv/bin/python":
error=2, No such file or directory
My question is: is our understanding of --py-files correct? I tried browsing the spark source code, but could not follow the flow of the --py-files argument in the case that it is a zip file. There are a number of tutorials for YARN mode and shipping conda environments in spark-submit, but not much on spark standalone; Is this even possible?
Addendum::These are the YARN tutorial I was learning from:
https://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/
http://quasiben.github.io/blog/2016/4/15/conda-spark/

The --py-files option will not unpack a zip file you provide it. The reason that python can use packages in a zip file is because python supports zips directly. However if the Python binary itself is packaged in such a way then Spark will not be able to locate it.
To achieve this instead you should use the (terribly documented) --archives option, which will unzip the archive you provide to a directory you specify:
PYSPARK_DRIVER_PYTHON=./.venv/bin/python \
spark-submit \
--archives site.zip#.venv \
src/app.py
The rather weird # syntax is used to specify an output directory, documented here.
edit: there's also a tutorial on using venv-pack to achieve the same thing here though what you're doing should already work.

Related

PySpark virtual environment archive on S3

I'm trying to deploy PySpark applications to an EMR cluster that have various, differing, third-party dependencies, and I am following this blog post, which describes a few approaches to packaging a virtual environment and distributing that across the cluster.
So, I've made a virtual environment with virtualenv, used venv-pack to create a tarball of the virtual environment, and I'm trying to pass that as an --archives argument to spark-submit:
spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.python=./venv/bin/python \
--archives s3://path/to/my/venv.tar.gz#venv \
s3://path/to/my/main.py
This fails with Cannot run program "./venv/bin/python": error=2, No such file or directory. Without the spark.pyspark.python option, the job fails with import errors, so my question is mainly about the syntax of this command when the archive is a remote object.
I can run a job that has no extra dependencies and a main method that's located remote to the cluster in S3, so I know at least something on S3 can be referenced (much like a Spark application JAR, which I'm much more familiar with). The problem is the virtual environment. I've found much literature about this, but it's all where the virtual environment archive is physically on the cluster. For many reasons, I would like to avoid having to copy virtual environments to the cluster every time a developer makes a new application.
Can I reference a remote archive? If so, what is the syntax for this and what other configuration options might I need?
I don't think it should matter, but just in case, I'm using a Livy client to submit this job remotely (the thing analogous to the above spark-submit).

Running Python app on Spark with Conda dependencies

I am trying to run a Python script in Spark. I am running Spark in client mode (i.e. single node) with a Python script that has some dependencies (e.g. pandas) installed via Conda. There are various resources which cover this usage case, for example:
https://conda.github.io/conda-pack/spark.html
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
Using those as an example I run Spark via the following command in the Spark bin directory, where /tmp/env.tar is the Conda environment packed by conda-pack:
export PYSPARK_PYTHON=./environment/bin/python
./spark-submit --archives=/tmp/env.tar#environment script.py
Spark throws the following exception:
java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
Why does this not work? I am curious also about the ./ in the Python path as it's not clear where Spark unpacks the tar file. I assumed I did not need to load the tar file into HDFS since this is all running on a single node (but perhaps I do for cluster mode?).

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

Spark-submit error - Cannot load main class from JAR file

I am trying to run on Hadoop with Spark but I have a "Cannot load main class from JAR file" error.
How can I fix this?
Try copying main.py and the the additional python files to a local file:// path instead of having them in hdfs.
You need to pass the additional python files with the --py-files argument from a local directory as well.
Assuming you copy the python files to your working directory where you are launching spark-submit from, try the following command:
spark-submit \
--name "Final Project" \
--py-files police_reports.py,three_one_one.py,vehicle_volumn_count.py \
main.py

How to lauch prorams in Apache spark?

I have a “myprogram.py” and my “myprogram.scala” that I need to run on my spark machine. How Can I upload and launch them?
I have been using shell to do my transformation and calling actions, but now I want to launch a complete program on spark machine instead of entering single commands every time. Also I believe that will make it easy for me to make changes to my program instead of starting to enter commands in shell.
I did standalone installation in Ubuntu 14.04, on single machine, not a cluster, used spark 1.4.1.
I went through spark docs online, but I only find instruction on how to do that on cluster. Please help me on that.
Thank you.
The documentation to do this (as commented above) is available here: http://spark.apache.org/docs/latest/submitting-applications.html
However, the code you need is here:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
You'll need to compile the scala file using sbt (documentation here: http://www.scala-sbt.org/0.13/tutorial/index.html)
Here's some information on the build.sbt file you'll need in order to grab the right dependencies: http://spark.apache.org/docs/latest/quick-start.html
Once the scala file is compiled, you'll send the resulting jar using the above submit command.
Put it simply:
In Linux terminal, cd to the directory that spark is unpacked/installed
Note, this folder normally contains subfolders like “bin”, “conf”, “lib”, “logs” and so on.
To run the Python program locally with simple/default settings, type command
./bin/spark-submit --master local[*] myprogram.py
More complete descriptions are here like zero323 and ApolloFortyNine described.

Resources