Spark-submit error - Cannot load main class from JAR file - apache-spark

I am trying to run on Hadoop with Spark but I have a "Cannot load main class from JAR file" error.
How can I fix this?

Try copying main.py and the the additional python files to a local file:// path instead of having them in hdfs.
You need to pass the additional python files with the --py-files argument from a local directory as well.
Assuming you copy the python files to your working directory where you are launching spark-submit from, try the following command:
spark-submit \
--name "Final Project" \
--py-files police_reports.py,three_one_one.py,vehicle_volumn_count.py \
main.py

Related

How to add jar files to $SPARK_HOME/jars correctly?

I have used this command and it works fine:
spark = SparkSession.builder.appName('Apptest')\
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.5').getOrCreate()
But I'd like to download the jar file and always start with:
spark = SparkSession.builder.appName('Apptest').getOrCreate()
How can I do it? I have tried:
Move to SPARK_HOME jar dir:
cd /de/spark-2.4.6-bin-hadoop2.7/jars
Download jar file
curl https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.3.5/mongo-spark-connector_2.11-2.3.5.jar --output mongo-spark-connector_2.11-2.3.5.jar
But spark don't see it. I got the following error:
Py4JJavaError: An error occurred while calling o66.save.
: java.lang.NoClassDefFoundError: com/mongodb/ConnectionString
I know there is ./spark-shell --jar command, but I am using jupyter notebook. Is there some step missing?
Since you're using SparkSession in the jupyter notebook, unfortunately you have to use the .config('spark.jars.packages', '...') to add the jars that you want when you're creating the spark object.
Instead, if you want to add the jar in "default" mode when you launch the notebook, I would recommend you to create a custom kernel, so that every time when you create a new notebook, you even don't need to create the spark. If you're using Anaconda, you can check the docs: https://docs.anaconda.com/ae-notebooks/admin-guide/install/config/custom-pyspark-kernel/
What I was looking for is .config("spark.jars",".."):
spark = SparkSession.builder.appName('Test')\
.config("spark.jars", "/root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar") \
.getOrCreate()
Or:
import os
os.environ["PYSPARK_SUBMIT_ARGS"]="--jars /root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar pyspark-shell"
Also, seems that only put the jar files in $SPARK_HOME/jars works fine as well, but in my case and question example was missing the dependency mongo-java-driver-3.12.5.jar. After download all dependencies in $SPARK_HOME/jars I was able to run only with:
spark = SparkSession.builder.appName('Test').getOrCreate()
I have find out the dependencies in: https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector_2.11/2.3.5

Running Python app on Spark with Conda dependencies

I am trying to run a Python script in Spark. I am running Spark in client mode (i.e. single node) with a Python script that has some dependencies (e.g. pandas) installed via Conda. There are various resources which cover this usage case, for example:
https://conda.github.io/conda-pack/spark.html
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
Using those as an example I run Spark via the following command in the Spark bin directory, where /tmp/env.tar is the Conda environment packed by conda-pack:
export PYSPARK_PYTHON=./environment/bin/python
./spark-submit --archives=/tmp/env.tar#environment script.py
Spark throws the following exception:
java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
Why does this not work? I am curious also about the ./ in the Python path as it's not clear where Spark unpacks the tar file. I assumed I did not need to load the tar file into HDFS since this is all running on a single node (but perhaps I do for cluster mode?).

Spark Standalone and Virtual Environments

With a spark cluster configured as spark-standalone, we are trying to configure spark-submit jobs to utilize virtual environments managed by pipenv.
The project has this structure:
project/
|-- .venv/
|--bin/python
|--lib/python3.6/site-packages
|-- src/
|-- app.py
The current attempt involves zipping the virtual environment (zip -r site.zip .venv) to include the python executable and all site packages, and ship that along to the executors.
The spark-submit command is currently:
PYSPARK_DRIVER_PYTHON=./.venv/bin/python \
spark-submit --py-files site.zip src/app.py
The thinking is that the --py-files argument should be unzipping the site.zip into the working directory on the executors, and .venv should be reproduced with the .venv/bin/python and site-packages available on the python path. This is clearly not the case as we are receiving the error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task
0.3 in stage 0.0 (TID 3, [executor-node-uri], executor 0):
java.io.IOException: Cannot run program "./.venv/bin/python":
error=2, No such file or directory
My question is: is our understanding of --py-files correct? I tried browsing the spark source code, but could not follow the flow of the --py-files argument in the case that it is a zip file. There are a number of tutorials for YARN mode and shipping conda environments in spark-submit, but not much on spark standalone; Is this even possible?
Addendum::These are the YARN tutorial I was learning from:
https://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/
http://quasiben.github.io/blog/2016/4/15/conda-spark/
The --py-files option will not unpack a zip file you provide it. The reason that python can use packages in a zip file is because python supports zips directly. However if the Python binary itself is packaged in such a way then Spark will not be able to locate it.
To achieve this instead you should use the (terribly documented) --archives option, which will unzip the archive you provide to a directory you specify:
PYSPARK_DRIVER_PYTHON=./.venv/bin/python \
spark-submit \
--archives site.zip#.venv \
src/app.py
The rather weird # syntax is used to specify an output directory, documented here.
edit: there's also a tutorial on using venv-pack to achieve the same thing here though what you're doing should already work.

How to deploy war file in spark-submit command (spark)

I am using
spark-submit --class main.Main --master local[2] /user/sampledata/parser-0.0.1-SNAPSHOT.jar
to run a java-spark code. Is it possible to run this code using war file instead of jar, since I am looking to deploy it on tomcat.
I tried war file but it gives class not found exception.

How to lauch prorams in Apache spark?

I have a “myprogram.py” and my “myprogram.scala” that I need to run on my spark machine. How Can I upload and launch them?
I have been using shell to do my transformation and calling actions, but now I want to launch a complete program on spark machine instead of entering single commands every time. Also I believe that will make it easy for me to make changes to my program instead of starting to enter commands in shell.
I did standalone installation in Ubuntu 14.04, on single machine, not a cluster, used spark 1.4.1.
I went through spark docs online, but I only find instruction on how to do that on cluster. Please help me on that.
Thank you.
The documentation to do this (as commented above) is available here: http://spark.apache.org/docs/latest/submitting-applications.html
However, the code you need is here:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
You'll need to compile the scala file using sbt (documentation here: http://www.scala-sbt.org/0.13/tutorial/index.html)
Here's some information on the build.sbt file you'll need in order to grab the right dependencies: http://spark.apache.org/docs/latest/quick-start.html
Once the scala file is compiled, you'll send the resulting jar using the above submit command.
Put it simply:
In Linux terminal, cd to the directory that spark is unpacked/installed
Note, this folder normally contains subfolders like “bin”, “conf”, “lib”, “logs” and so on.
To run the Python program locally with simple/default settings, type command
./bin/spark-submit --master local[*] myprogram.py
More complete descriptions are here like zero323 and ApolloFortyNine described.

Resources