Spark-submit in cluster mode - apache-spark

I'm facing a problem launching a Spark Application in cluster mode
This is the .sh :
export SPARK_MAJOR_VERSION=2
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 8G \
--executor-memory 8G \
--total-executor-cores 4 \
--num-executors 4 \
/home/hdfs/spark_scripts/ETL.py &> /home/hdfs/spark_scripts/log_spark.txt
In YARN logs, I found out that there's an Import Error related to a .py file that I need in "ETL.py". In other words, in "ETL.py", I',ve got a line in which I do this :
import AppUtility
AppUtilit.py is in the same path of ETL.py
In local mode,it works
This is the YARN log:
20/04/28 10:59:59 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e64_1584554814241_22431_02_000001 on ftpandbit02.carte.local_45454
LogAggregationType: AGGREGATED
LogType:stdout
LogLastModifiedTime:Tue Apr 28 10:57:10 +0200 2020
LogLength:138
LogContents:
Traceback (most recent call last):
File "ETL.py", line 8, in
import AppUtility
ImportError: No module named AppUtility
End of LogType:stdout
End of LogType:prelaunch.err

It depends on either client mode or cluster mode.
If you use Spark in Yarn client mode,
you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
Using Spark with Yarn cluster mode, is a different story. You can distribute python dependencies with
spark-submit ./bin/spark-submit --py-files AppUtility.py
/home/hdfs/spark_scripts/ETL.py
The --py-files directive sends the file to the Spark workers but does not add it to the PYTHONPATH.
To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, ETL.py
sc.addPyFile(PATH)
PATH: AppUtility.py (It can be either a local file, a file in HDFS,zip or an HTTP, HTTPS or FTP URI)

Related

FileNotFound error when running spark-submit

I am trying to run the spark-submit command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf
I believe spark.yarn.stagingDir needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.
Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master and deploy-mode has to be set. From the spark 3.3.0 docs:
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Or programatically:
spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()

Pyspark yarn cluster submit error (Cannot run Program Python)

I am trying to submit pyspark code with pandas udf (to use fbprophet...)
it works well in local submit but gets error in cluster submit such as
Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 41, ip-172-31-11-94.ap-northeast-2.compute.internal, executor 2): java.io.IOException: Cannot run program
"/mnt/yarn/usercache/hadoop/appcache/application_1620263926111_0229/container_1620263926111_0229_01_000001/environment/bin/python": error=2, No such file or directory
my spark-submit code:
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
--jars jars/org.elasticsearch_elasticsearch-spark-20_2.11-7.10.2.jar \
--py-files dependencies.zip \
--archives ./environment.tar.gz#environment \
--files config.ini \
$1
I made environment.tar.gz by conda pack, dependencies.zip as my local packages and
config.ini to load settings
Is there anyway to handle this problem?
You can't use local path:
--archives ./environment.tar.gz#environment
Publish environment.tar.gz on hdfs
venv-pack -o environment.tar.gz
# or conda pack
hdfs dfs -put -f environment.tar.gz /spark/app_name/
hdfs dfs -chmod 0664 /spark/app_name/environment.tar.gz
And change argument of spark-submit
--archives hdfs:///spark/app_name/environment.tar.gz#environment
More info: PySpark on YARN in self-contained environments

Spark's example throws FileNotFoundException in client mode

I have: Ubuntu 14.04, Hadoop 2.7.7, Spark 2.2.0.
I just installed everything.
When I try to run the the Spark's example:
bin/spark-submit --deploy-mode client \
--class org.apache.spark.examples.SparkPi \
examples/jars/spark-examples_2.11-2.2.0.jar 10
I get the following error:
INFO yarn.Client:
client token: N/A
diagnostics: Application application_1552490646290_0007 failed 2 times due to AM Container for
appattempt_1552490646290_0007_000002 exited with exitCode: -1000 For
more detailed output, check application tracking
page:http://ip-123-45-67-89:8088/cluster/app/application_1552490646290_0007 Then,
click on links to logs of each attempt. Diagnostics: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist java.io.FileNotFoundException: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:421)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
I get the same error both in client mode and cluster mode.
It seems that it fails by loading the spark libs. As Daniel points out, it could be related to your read rights. Besides, it could be related to running out of disk space.
However, in our case to avoid transfer latencies to the master and read/write rights in the local machine, we put the spark-libs in the HDFS of the Yarn cluster and then, we point them in the spark.yarn.archive property.
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
export HADOOP_USER_NAME=hadoop
hadoop fs -mkdir -p /apps/spark/
hadoop fs -put -f ${SPARK_HOME}/spark-libs.jar /apps/spark/
# spark-defaults.conf
spark.yarn.archive hdfs:///apps/spark/spark-libs.jar
First, Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Second , If you run on YARN mode you have point your master to yarn submitting-applications, and put your jar file in hdfs
# Run on a YARN cluster
# Connect to a YARN cluster in client or cluster mode depending on the value
# of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR
# or YARN_CONF_DIR variable.
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
hdfs://path/to/spark-examples.jar
1000

Spark YARN on EMR - JavaSparkContext - IllegalStateException: Library directory does not exist

I have Java Spark job that works on manually deployed Spark 1.6.0 in standalone mode on an EC2.
I am spark-submitting this job to a EMR 5.3.0 cluster on the master using YARN but it fails.
Spark-submit line is,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
The "yarn-client" is the first argument to the x.jar application and is fed to the SparkContext as setMaster,
conf.setMaster(args[0]);
When I submit it, it starts out running fine, until I initialize the JavaSparkContext from a SparkConf,
JavaSparkContext sc = new JavaSparkContext(conf);
... and then Spark crashes.
In the YARN log, I can see the following,
yarn logs -applicationId application_1487325147456_0051
...
17/02/17 16:27:13 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/17 16:27:13 INFO Client: Deleted staging directory hdfs://ip-172-31-8-237.eu-west-1.compute.internal:8020/user/ec2-user/.sparkStaging/application_1487325147456_0052
17/02/17 16:27:13 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/ec2-user/appcache/application_1487325147456_0051/container_1487325147456_0051_01_000001/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
...
Noting the WARN of spark.yarn.jars flag missing, I found a spark yarn JAR file in
/usr/lib/spark/jars/
... and uploaded it to HDFS per Cloudera's guide on how to run YARN applications on Spark and tried to add that conf, so this became my spark-submit line,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --conf spark.yarn.jars=hdfs://`hostname -f`:8020/sparkyarnlibs/spark-yarn_2.11-2.1.0.jar --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
But that did not work and gave this:
Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I am really puzzled as to what that Library error is caused by and how to proceed onwards from here.
You have specified "--deploy-mode cluster" and yet are calling conf.setMaster("yarn-client") from the code. Using a master URL of "yarn-client" means "use YARN as the master, and use client mode (not cluster mode)", so I wouldn't be surprised if this is somehow confusing Spark because on one hand you're telling it to use cluster mode and on the other you're telling it to use client mode.
By the way, using a master URL like "yarn-client" or "yarn-cluster" is actually deprecated because the "-client" or "-cluster" part is not really part of the Master but rather is the deploy mode. That is, "--master yarn-client" is really more of a shortcut/alias for "--master yarn --deploy-mode client", and similarly "--master yarn-cluster" just means "--master yarn --deploy-mode cluster".
My recommendation would be to not call conf.setMaster() from your code, since the master is already set to "yarn" automatically in /etc/spark/conf/spark-defaults.conf. For this reason, you also don't need to pass "--master yarn" to spark-submit.
Lastly, it sounds like you need to decide whether you really want to use client deploy mode or cluster deploy mode. With client deploy mode, the driver runs on the master instance, and with cluster deploy mode, the driver runs in a YARN container on one of the core/task instances. See https://spark.apache.org/docs/latest/running-on-yarn.html for more information.
If you want to use client deploy mode, you don't need to pass anything extra because it's already the default. If you want to use cluster deploy mode, pass "--deploy-mode cluster".

Erro spark-assembly-1.4.1-hadoop2.6.0.jar does not exist

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File
file:/Users/nish1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar
does not exist
I can see in my service list ,
YARN + MapReduce2 2.7.1.2.3 Apache Hadoop NextGen MapReduce (YARN)
Spark 1.4.1.2.3 Apache Spark is a fast and general engine for
large-scale data processing.
already installed.
My spark-env.sh in local machine
export HADOOP_CONF_DIR=/Users/nish1013/Dev/hadoop-2.7.1/etc/hadoop
Has anyone encountered similar before ?
I think the right command to call is like following:
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 --conf spark.yarn.jars=hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
or you can add
spark.yarn.jars hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
in your spark.default.conf file

Resources