I am using a cluster, which is not managed by myself. Tensorflow libraries are not installed on any cluster nodes. But I would like to run some Spark programs using tensorflow package. I am not sure if it is possible to simply use spark-submit --packages to broadcast tensorflow packages across the cluster nodes.
I am not sure about Tensorflow itself, but you can pass a local jars using --jars and files using --files to the job. Below is an example:
spark-submit --master yarn-cluster --num-executors 5 --driver-memory 640m --executor-memory 640m --conf spark.yarn.maxAppAttempts=1000 \
--jars /usr/hdp/current/spark-client-1.6.1/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/hdp/current/spark-client-1.6.1/conf/hive-site.xml \
--class com.foobar.main
This is an example of how I start spark streaming job and the Application Master and Driver run on the cluster where spark is not installed. So I need to pass a long some jars and configs for it to run.
Related
Is there any way I can run the Spark3 application with two different java versions for example java8 and java11?
I have tried the following command but it is picking the default java version i.e java8.
spark-shell \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/ \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/
I want to initialise pyspark version 3.3.1 on aws cloud9 and to read a s3 file path from AWS. But when I run the code, I got this error shown in the image.
I was thinking that there is something wrong with my Pyspark initilisation, and I have tried the code below provided by my colleague but apparently this doesn't work for me. enter image description here
My pyspark version is 3.3.1 and hadoop version 3
pkg_list=org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hadoop:hadoop-aws:2.7.1
pyspark --packages $pkg_list --driver-memory 32G --driver-cores 8 --num-executors 8 --executor-memory 32G --executor-cores 8 --driver-java-options="-Djava.io.tmpdir=/home/yoongkiat/tempfiles"
The error is saying that in some hadoop config file or option that Spark is using, you have a string 64M, but it's only expecting a number.
The error doesn't say which file, and that's not a value you've provided on the command line, so you'll need to debug the installation on your own. As mentioned in comments, AWS EMR already offers a functional Spark environment.
By the, you cannot use dependencies from different Spark versions; you're running 3.3.1, but trying to add spark-avro for 2.4.4. I'm also not certain you'll need to add hadoop-aws since Spark should have those libraries included out of the box.
I am trying to submit a Python Application using spark-submit, like so:
spark-submit \
--conf spark.submit.pyFiles=path/to/archive.zip \
--conf spark.app.name=Test123 \
--conf spark.master=local[2] \
--conf spark.driver.memory=5G \
path/to/python_app.py
The python_app.py tries to import modules from archive.zip, but it fails with an ModuleNotFoundError. If I substitute
--conf spark.submit.pyFiles=path/to/archive.zip
with
--py-files path/to/archive.zip
it works as expected. It is really weird because setting master, driver memory and app name works using --conf.
What am I missing here? Thanks!
Edit 2018-07-06:
I tried this with Spark versions 2.1.3, 2.2.0 and 2.3.1 - the problem is the same for all three versions. And: I have the problem regardless of submitting to local[x] or yarn.
I had the same problem recently. I believe naming might be misleading here.
setting spark.submit.pyFiles states only that you want to add them to PYTHONPATH. But apart of that you need to upload those files to all your executors working directory. You can do that with spark.files
For me it does the job. I am setting those values in spark-defauls.conf
I am trying to execute a program from spark-shell with the below command
spark-submit --class com.aadharpoc.spark.UIDStats \ --packages com.databricks:spark-csv_2.10:1.5.0 \ --master yarn-client \ /home/cloudera/Desktop/aadhar_jar/Untitled.jar \ /home/cloudera/Desktop/UIDAI-ENR-DETAIL.csv
the following error prompoted
<console>:1: error: ';' expected but 'class' found.
spark-submit --class com.aadharpoc.spark.UIDStats \ --packages com.databricks:spark-csv_2.10:1.5.0 \ --master local[*] \ /home/cloudera/Desktop/aadhar_jar/Untitled.jar \ /home/cloudera/Desktop/UIDAI-ENR-DETAIL.csv
Thanks guys!!
You should not run the spark-submit from scala REPL or spark-shell You should run the spark-submit from normal linux-shell or terminal.
I hope this solves the issue.
spark-submit is a script used to submit a spark program and it is available in bin directory. It should be run from terminal and not from spark-shell.
In Windows, if env variable is updated till %SPARK_HOME%/bin then in window, just open command prompt and run spark-submit.
In Linux, SPARK_HOME must be in your .bashrc, then you can run from terminal else provide full qualified path like ....\spark-submit.sh
In this presentation they show an example of "upgrading" Spark version just by passing newer spark-assembly.jar as a dependency. Here's relevant snippet ("upgrading" from Spark 0.9 to 1.1):
export SPARK_JAR=/usr/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar
java -cp /etc/hadoop/conf:AppJar.jar:spark-assembly.jar org.apache.spark.deploy.yarn.Client --jar AppJar.jar --addJars /jars/config.jar --class ooyala.app.MainClass --arg arg1 --arg arg2 --name MyApp
This is very nice possibility since it allows to use latest features without the need to upgrade the whole cluster very often. However, code above is totally outdated now, so I tried to use something similar with spark-submit (trying to add jar of Spark 1.5 to a cluster running Spark 1.2):
~/spark-1.5/bin/spark-submit \
--jars ~/spark-assembly-1.5.1-hadoop2.4.0.jar \
--class ooyala.app.MainClass
--master yarn-client
ooyala-test_2.10-1.0.jar
But it doesn't work either, resulting in NullPointerException deep in Spark internals.
Does anyone have experience doing this trick on recent versions of Spark?