Spark on YARN : Job Submitted v/s Accepted? - apache-spark

I am running spark job on YARN-cluster mode . What is the difference between YARN Accepted and YARN Submitted status ?

We submit the spark job using spark-submit (cluster mode YARN).
YARN submitted: Job has submitted to the YARN scheduler queue (FIFO/Fair scheduler) and waiting for its turn.
YARN accepted: YARN has started execution of the job but only application master is running, Application master has not got resources from the resource manager to run the job.

Related

spark.shuffle.service.enabled=true cluster.YarnScheduler: Initial job has not accepted any resources

I am trying to run a pyspark job using yarn with the spark.shuffle.service.enabled=true option but the job never completes :
Without the option, the job works well:
user#e7524bf7f996:~$ pyspark --master yarn
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0004).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
45
With the option --conf spark.shuffle.service.enabled=true
user#e7524bf7f996:~$ pyspark --master yarn --conf spark.shuffle.service.enabled=true
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0005).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
2022-02-15 15:10:14,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:29,590 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:44,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Are there other options in Spark or Yarn that should be enabled to make spark.shuffle.service.enabled work ?
I am running Spark 3.1.2 , Python 3.9.7, hadoop-3.2.1
Thank you,
Bertrand
You need to configure external shuffle service on Yarn cluster by following
Build Spark with the YARN profile. Skip this step if you are using a
pre-packaged distribution.
Locate the
spark-<version>-yarn-shuffle.jar. This should be under
$SPARK_HOME/common/network-yarn/target/scala- if you are
building Spark yourself, and under yarn if you are using a
distribution.
Add this jar to the classpath of all NodeManagers in
your cluster.
In the yarn-site.xml on each node, add spark_shuffle
to yarn.nodemanager.aux-services, then set
yarn.nodemanager.aux-services.spark_shuffle.class to
org.apache.spark.network.yarn.YarnShuffleService.
Increase
NodeManager's heap size by setting YARN_HEAPSIZE (1000 by default)
in etc/hadoop/yarn-env.sh to avoid garbage collection issues during
shuffle.
Restart all NodeManagers in your cluster.
For details, please refer https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
If still not working, check below:
Check Yarn UI to ensure enough resources available.
Try --deploy-mode cluster to ensure driver could communicate with yarn cluster for scheduling
Thanks Warren for your help.
Here is the setup working for me:
https://github.com/BertrandBrelier/SparkYarn/blob/main/yarn-site.xml
echo "export YARN_HEAPSIZE=2000" >> /home/user/hadoop-3.2.1/etc/hadoop/yarn-env.sh
ln -s /home/user/spark-3.1.2-bin-hadoop3.2/yarn/spark-3.1.2-yarn-shuffle.jar /home/user/hadoop-3.2.1/share/hadoop/yarn/lib/.
echo "spark.shuffle.service.enabled true" >> /home/user/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf
restarting Hadoop and Spark
I was able to start a pyspark session:
pyspark --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true

apache_beam spark runner with python can't be implemented on remote spark cluster?

i am following the python guide beam spark runner,and the beam_pipeline can submit job to a local jobserver which is launched by ./gradlew :runners:spark:job-server:runShadow with a local spark,
and the addition parameter-PsparkMasterUrl=spark://localhost:7077 to a pre-deployed spark.
But i have a spark cluster on yarn, i set the launch command as ./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=yarn(also tried yarn-client), but only get org.apache.spark.SparkException: Could not parse Master URL: 'yarn'
and the source code of the spark runner(beam\sdks\python\apache_beam\runners\portability\spark_runnner.py) shows that:
parser.add_argument('--spark_master_url',
default='local[4]',
help='Spark master URL (spark://HOST:PORT). '
'Use "local" (single-threaded) or "local[*]" '
'(multi-threaded) to start a local cluster for '
'the execution.')
it doesn't mention 'yarn', and the Provided SparkContext and StreamingListeners are not supported on the Spark portable runner. So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly) and can only be test locally? or maybe i can set the job_endpoint as the remote job server url of my spark cluster.
and the every ./gradlew command blocked at 98%,but the jab server started with info like that:
19/11/28 13:47:48 INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver: JobService started on localhost:8099
<============-> 98% EXECUTING [16s]
> IDLE
> :runners:spark:job-server:runShadow
> IDLE
So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly)
We've recently added portable Spark jars, which can be submitted via spark-submit. This feature isn't scheduled be included a Beam release until 2.19.0, however.
I created a JIRA ticket to track the status of YARN support, in case there are other related issues that need to be addressed.
and the every ./gradlew command blocked at 98%
That's expected behavior. The job server will stay running until canceled.

How to understand spark-submit script master is YARN?

We have all 6 machine, hdfs and yarn service on all node, 1 master and 6 slaves.
And we install Spark on 3 machine, 1 master, 3 workers ( 1 node master + worker) .
We know when --master spark://[host]:[port], the job will run only 3 node use standalone mode.
And when use spark-submit --master yarn submit a jar, it's would use all 6 server cpu and memory or just use 3 spark worker node machine ?
And if can run all 6 node, How left 3 server can know it's the Spark job?
Spark: 2.3.1
Hadoop: 2.7.3
In yarn mode, spark-submit send resource allocation resource to yarn and the containers will be launched on different node managers based on resource availability.

Spark in Yarn Web UI not getting displayed

I am unable to view Spark history through Yarn UI(yarn web address 8088 in yarn-site.ml). Spark job completed successfully
Spark application was run in datanode shell with cluster-mode as cluster
When clicked on history it is redirecting to http://namenode:18088/history/application_1472647811761_0001/1 and it says page cannot be displayed
Hadoop Version: 2.7.0
Spark Version: 2.0.0
Cluster: one namenode and one datanode
spark-default.xml
spark.eventLog.dir=hdfs://namenode:9000/user/spark/applicationHistory
spark.eventLog.enabled=true
spark.yarn.historyServer.address=namenode:18088
spark.history.fs.logDirectory=hdfs://namenode:9000/shared/spark-logs

How to submit pyspark job in yarn cluster mode from code

Can we submit a pyspark job in yarn cluster mode from Python code.
spark-submit is the command for submit the pyspark job on spark and we have to mention yarn cluster mode for deploy the job on cluster.
spark-submit --master yarn --deploy-mode cluster py_files.py

Resources