Spark execution error: spark-submit - apache-spark

I am trying to execute a program from spark-shell with the below command
spark-submit --class com.aadharpoc.spark.UIDStats \ --packages com.databricks:spark-csv_2.10:1.5.0 \ --master yarn-client \ /home/cloudera/Desktop/aadhar_jar/Untitled.jar \ /home/cloudera/Desktop/UIDAI-ENR-DETAIL.csv
the following error prompoted
<console>:1: error: ';' expected but 'class' found.
spark-submit --class com.aadharpoc.spark.UIDStats \ --packages com.databricks:spark-csv_2.10:1.5.0 \ --master local[*] \ /home/cloudera/Desktop/aadhar_jar/Untitled.jar \ /home/cloudera/Desktop/UIDAI-ENR-DETAIL.csv
Thanks guys!!

You should not run the spark-submit from scala REPL or spark-shell You should run the spark-submit from normal linux-shell or terminal.
I hope this solves the issue.

spark-submit is a script used to submit a spark program and it is available in bin directory. It should be run from terminal and not from spark-shell.
In Windows, if env variable is updated till %SPARK_HOME%/bin then in window, just open command prompt and run spark-submit.
In Linux, SPARK_HOME must be in your .bashrc, then you can run from terminal else provide full qualified path like ....\spark-submit.sh

Related

Running spark application with two different java versions

Is there any way I can run the Spark3 application with two different java versions for example java8 and java11?
I have tried the following command but it is picking the default java version i.e java8.
spark-shell \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/ \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/

spark-submit passing python files in zip does not work

I am trying to submit a Python Application using spark-submit, like so:
spark-submit \
--conf spark.submit.pyFiles=path/to/archive.zip \
--conf spark.app.name=Test123 \
--conf spark.master=local[2] \
--conf spark.driver.memory=5G \
path/to/python_app.py
The python_app.py tries to import modules from archive.zip, but it fails with an ModuleNotFoundError. If I substitute
--conf spark.submit.pyFiles=path/to/archive.zip
with
--py-files path/to/archive.zip
it works as expected. It is really weird because setting master, driver memory and app name works using --conf.
What am I missing here? Thanks!
Edit 2018-07-06:
I tried this with Spark versions 2.1.3, 2.2.0 and 2.3.1 - the problem is the same for all three versions. And: I have the problem regardless of submitting to local[x] or yarn.
I had the same problem recently. I believe naming might be misleading here.
setting spark.submit.pyFiles states only that you want to add them to PYTHONPATH. But apart of that you need to upload those files to all your executors working directory. You can do that with spark.files
For me it does the job. I am setting those values in spark-defauls.conf

Spark-submit error - Cannot load main class from JAR file

I am trying to run on Hadoop with Spark but I have a "Cannot load main class from JAR file" error.
How can I fix this?
Try copying main.py and the the additional python files to a local file:// path instead of having them in hdfs.
You need to pass the additional python files with the --py-files argument from a local directory as well.
Assuming you copy the python files to your working directory where you are launching spark-submit from, try the following command:
spark-submit \
--name "Final Project" \
--py-files police_reports.py,three_one_one.py,vehicle_volumn_count.py \
main.py

Is it possible to broadcast Tensorflow libraries using spark-submit -package

I am using a cluster, which is not managed by myself. Tensorflow libraries are not installed on any cluster nodes. But I would like to run some Spark programs using tensorflow package. I am not sure if it is possible to simply use spark-submit --packages to broadcast tensorflow packages across the cluster nodes.
I am not sure about Tensorflow itself, but you can pass a local jars using --jars and files using --files to the job. Below is an example:
spark-submit --master yarn-cluster --num-executors 5 --driver-memory 640m --executor-memory 640m --conf spark.yarn.maxAppAttempts=1000 \
--jars /usr/hdp/current/spark-client-1.6.1/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/hdp/current/spark-client-1.6.1/conf/hive-site.xml \
--class com.foobar.main
This is an example of how I start spark streaming job and the Application Master and Driver run on the cluster where spark is not installed. So I need to pass a long some jars and configs for it to run.

Overwrite Spark version in depencency

In this presentation they show an example of "upgrading" Spark version just by passing newer spark-assembly.jar as a dependency. Here's relevant snippet ("upgrading" from Spark 0.9 to 1.1):
export SPARK_JAR=/usr/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar
java -cp /etc/hadoop/conf:AppJar.jar:spark-assembly.jar org.apache.spark.deploy.yarn.Client --jar AppJar.jar --addJars /jars/config.jar --class ooyala.app.MainClass --arg arg1 --arg arg2 --name MyApp
This is very nice possibility since it allows to use latest features without the need to upgrade the whole cluster very often. However, code above is totally outdated now, so I tried to use something similar with spark-submit (trying to add jar of Spark 1.5 to a cluster running Spark 1.2):
~/spark-1.5/bin/spark-submit \
--jars ~/spark-assembly-1.5.1-hadoop2.4.0.jar \
--class ooyala.app.MainClass
--master yarn-client
ooyala-test_2.10-1.0.jar
But it doesn't work either, resulting in NullPointerException deep in Spark internals.
Does anyone have experience doing this trick on recent versions of Spark?

Resources