Spark and YARN - how to work together with them

Spark and YARN - how to work together with them - apache-spark

I have a conceptual doubt.
It would be about YARN and SPARK, I have a 2 YARN (AM) 28GB with 4 CPUs and a WORKNODE of 56GB with 8 cpus.
I submit my applications always through the YARN yarn-cluster in the spark-submit option.
How do I use all memory and worknode cpus if the YARN server has fewer resources?
Can worknode settings overlap with YARN settings?
If my "spark.executor.memory" parameter is larger than the YARN memory, will it be used or not ?
With using the full potential of my worknode?

Related

How to understand spark-submit script master is YARN?

We have all 6 machine, hdfs and yarn service on all node, 1 master and 6 slaves.
And we install Spark on 3 machine, 1 master, 3 workers ( 1 node master + worker) .
We know when --master spark://[host]:[port], the job will run only 3 node use standalone mode.
And when use spark-submit --master yarn submit a jar, it's would use all 6 server cpu and memory or just use 3 spark worker node machine ?
And if can run all 6 node, How left 3 server can know it's the Spark job?
Spark: 2.3.1
Hadoop: 2.7.3

In yarn mode, spark-submit send resource allocation resource to yarn and the containers will be launched on different node managers based on resource availability.

How to configure the Yarn cluster with spark?

I have 2 machines with 32gb ram and 8core each machine. So how can i configure the yarn with spark and which properties i have to use to tune the resources according to our dataset. I have 8gb dataset, So can anyone suggest the configuration of yarn with spark in parallel jobs running?
Here is the yarn configuration:
I'm using hadoop 2.7.3,spark 2.2.0 and ubuntu 16
`yarn scheduler minimum-allocation-mb--2048
yarn scheduler maximum-allocation-mb--5120
yarn nodemanager resource.memory-mb--30720
yarn scheduler minimum-allocation-vcores--1
yarn scheduler maximum-allocation-vcores--6
yarn nodemanager resource.cpu-vcores--6`
Here is the spark configuration:
spark master master:7077
spark yarn am memory 4g
spark yarn am cores 4
spark yarn am memoryOverhead 412m
spark executor instances 3
spark executor cores 4
spark executor memory 4g
spark yarn executor memoryOverhead 412m
but my question is with 32gb ram and 8core each machine. how many applications i can run whether this conf is correct? bcoz only two applications running parallely.

Spark thrift server use only 2 cores

Google dataproc one node cluster, VCores Total = 8. I've tried from user spark:
/usr/lib/spark/sbin/start-thriftserver.sh --num-executors 2 --executor-cores 4
tried to change /usr/lib/spark/conf/spark-defaults.conf
tried to execute
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
before start-thriftserver.sh
No success. In yarn UI I can see that thrift app use only 2 cores and 6 cores available.
UPDATE1:
environment tab at spark ui:
spark.submit.deployMode client
spark.master yarn
spark.dynamicAllocation.minExecutors 6
spark.dynamicAllocation.maxExecutors 10000
spark.executor.cores 4
spark.executor.instances 1

It depends on what yarn mode is that app in.
Can be yarn client - 1 core for Application Master(the app will be running on the machine where you ran command start-thriftserver.sh).
In case of yarn cluster - Driver will be inside AM container, so you can tweak cores with spark.driver.cores. Other cores will be used by executors (1 executor = 1 core by default)
Beware that --num-executors 2 --executor-cores 4 wouldn't work as you have 8 cores max and +1 will be needed for AM container (total of 9)
You can check cores usage from Spark UI - http://sparkhistoryserverip:18080/history/application_1534847473069_0001/executors/
Options below are only for Spark standalone mode:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
Please review all configs here - Spark Configuration (latest)
In your case you can edit spark-defaults.conf and add:
spark.executor.cores 3
spark.executor.instances 2
Or use local[8] mode as you have only one node anyway.

If you want YARN shows you proper number of cores allocated to executors change value in capacity-scheduler.xml for:
yarn.scheduler.capacity.resource-calculator
from:
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
to:
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
Otherwise it doesn't matter how many cores you ask for your executors, YARN will show you only one core per container.
Actually this config changes resource allocation behavior. More details: https://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/

Spark client mode - YARN allocates a container for driver?

I am running Spark on YARN in client mode, so I expect that YARN will allocate containers only for the executors. Yet, from what I am seeing, it seems like a container is also allocated for the driver, and I don't get as many executors as I was expecting.
I am running spark submit on the master node. Parameters are as follows:
sudo spark-submit --class ... \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.yarn.am.cores=2 \
--conf spark.yarn.am.memory=8G \
--conf spark.executor.instances=5 \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=10G \
--conf spark.dynamicAllocation.enabled=false \
While running this application, Spark UI's Executors page shows 1 driver and 4 executors (5 entries in total). I would expect 5, not 4 executors.
At the same time, YARN UI's Nodes tab shows that on the node that isn't actually used (at least according to Spark UI's Executors page...) there's a container allocated, using 9GB of memory. The rest of the nodes have containers running on them, 11GB of memory each.
Because in my Spark Submit the driver has 2GB less memory than executors, I think that the 9GB container allocated by YARN is for the driver.
Why is this extra container allocated? How can i prevent this?
Spark UI:
YARN UI:
Update after answer by Igor Dvorzhak
I was falsely assuming that the AM will run on the master node, and that it will contain the driver app (so setting spark.yarn.am.* settings will relate to the driver process).
So I've made the following changes:
set the spark.yarn.am.* settings to defaults (512m of memory, 1 core)
set the driver memory through spark.driver.memory to 8g
did not try to set driver cores at all, since it is only valid for cluster mode
Because AM on default settings takes up 512m + 384m of overhead, its container fits into the spare 1GB of free memory on a worker node.
Spark gets the 5 executors it requested, and the driver memory is appropriate to the 8g setting. All works as expected now.
Spark UI:
YARN UI:

Extra container is allocated for YARN application master:
In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
Even though in client mode driver runs in the client process, YARN application master is still running on YARN and requires container allocation.
There are no way to prevent container allocation for YARN application master.
For reference, similar question asked time ago: Resource Allocation with Spark and Yarn.

You can specify the driver memory and number of executors in spark submit as below.
spark-submit --jars..... --master yarn --deploy-mode cluster --driver-memory 2g --driver-cores 4 --num-executors 5 --executor-memory 10G --executor-cores 3
Hope it helps you.

launched executors are less than number of executors specified

I have EMR cluster with following configuration:
Number of cores, RAM(GB), yarn.nodemanager.resource.memory-mb(MB)
Master: 4 15 11532
core(slave1): 16 30 23040
core(slave2): 16 30 23040
core(slave3): 16 30 23040
core(slave4): 16 30 23040
I am starting a spark application with one job that gets divided into 2 stages using --master yarn-client with following configurations:
--num-executors 12 --executor-cores 5 --executor-memory 7G ---->(1)
--num-executors 12 --executor-cores 5 --executor-memory 6G ---->(2)
I have not modified any other parameter so spark.storage.* and spark.shuffle.* fractions are default.
calculations that I performed to find above configuration (master node is not performing any computation i.e verified using Ganglia except serving as a driver) are:
1. allocated 15 cores to yarn per node and started 3 executors/node
which implies 4(# of slave nodes)*3 = 12 executors.
2. 15 cores/3 executors = 5 cores per executor
3. 23040*(1-0.07) ~ 21G. Dividing this among three executors i.e
21/3=7G
In the (1) configuration, it is not launching 12 executors whereas in the (2) case it is able to do so. Though the memory is available per executor to do so, why it is not able to launch 12 executors in the (1) case?

What is your memory utilization like? Have you checked yarn-site.xml on the node managers hosts to see if all that memory and cpu is being exposed via node manager configuration?
You can do yarn node -list for a list of nodes and then yarn node -status (I beieve) to see a listing of what this node exposes to yarn as far as resources.
consult yarn log -applicationId to see a detailed log of your application interaction including captures of output.
Finally look at yarn logs on the resource manager host to see if there are any issues there

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string