How to understand spark-submit script master is YARN? - apache-spark

We have all 6 machine, hdfs and yarn service on all node, 1 master and 6 slaves.
And we install Spark on 3 machine, 1 master, 3 workers ( 1 node master + worker) .
We know when --master spark://[host]:[port], the job will run only 3 node use standalone mode.
And when use spark-submit --master yarn submit a jar, it's would use all 6 server cpu and memory or just use 3 spark worker node machine ?
And if can run all 6 node, How left 3 server can know it's the Spark job?
Spark: 2.3.1
Hadoop: 2.7.3

In yarn mode, spark-submit send resource allocation resource to yarn and the containers will be launched on different node managers based on resource availability.

Related

Why Spark Drivers don't run when Applications run?

I'm a beginner trying to learn about the behavior of applications and drivers by going through some examples. I'm starting off with:
Running a standalone cluster manager
Running a single master calling ./sbin/start-master.sh
Running a single worker calling ./sbin/start-slave.sh spark://localhost:7077
Launching a test application in client mode by calling:
./bin/spark-submit \
--master spark://localhost:7077 \
./examples/src/main/python/pi.py
According to the Docs:
The process running the main() function of the application and creating the SparkContext
My takeaway from this is there should be at least one driver program that runs when an application runs. However, I'm not seeing this in the web UI for the master:
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 15.0 GB Total, 0.0 B Used
Applications: 0 Running, 1 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Shouldn't I expect to see 1 driver running or completed? I've included some config details below.
./conf/spark-defaults.conf:
spark.master=spark://localhost:7077
spark.eventLog.enabled=true
spark.eventLog.dir=./tmp/spark-events/
spark.history.fs.logDirectory=.tmp/spark-events/
spark.driver.memory=5g
If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode.
Client mode does not triggers Driver program but cluster mode does.
NOTE : AFAIK you cannot run pyspark or any other interactive shell in cluster mode.
So try running the application in cluster mode using --deploy-mode cluster
./bin/spark-submit \
--master spark://localhost:7077 \
--deploy-mode cluster
./examples/src/main/python/pi.py

How to configure the Yarn cluster with spark?

I have 2 machines with 32gb ram and 8core each machine. So how can i configure the yarn with spark and which properties i have to use to tune the resources according to our dataset. I have 8gb dataset, So can anyone suggest the configuration of yarn with spark in parallel jobs running?
Here is the yarn configuration:
I'm using hadoop 2.7.3,spark 2.2.0 and ubuntu 16
`yarn scheduler minimum-allocation-mb--2048
yarn scheduler maximum-allocation-mb--5120
yarn nodemanager resource.memory-mb--30720
yarn scheduler minimum-allocation-vcores--1
yarn scheduler maximum-allocation-vcores--6
yarn nodemanager resource.cpu-vcores--6`
Here is the spark configuration:
spark master master:7077
spark yarn am memory 4g
spark yarn am cores 4
spark yarn am memoryOverhead 412m
spark executor instances 3
spark executor cores 4
spark executor memory 4g
spark yarn executor memoryOverhead 412m
but my question is with 32gb ram and 8core each machine. how many applications i can run whether this conf is correct? bcoz only two applications running parallely.

Spark thrift server use only 2 cores

Google dataproc one node cluster, VCores Total = 8. I've tried from user spark:
/usr/lib/spark/sbin/start-thriftserver.sh --num-executors 2 --executor-cores 4
tried to change /usr/lib/spark/conf/spark-defaults.conf
tried to execute
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
before start-thriftserver.sh
No success. In yarn UI I can see that thrift app use only 2 cores and 6 cores available.
UPDATE1:
environment tab at spark ui:
spark.submit.deployMode client
spark.master yarn
spark.dynamicAllocation.minExecutors 6
spark.dynamicAllocation.maxExecutors 10000
spark.executor.cores 4
spark.executor.instances 1
It depends on what yarn mode is that app in.
Can be yarn client - 1 core for Application Master(the app will be running on the machine where you ran command start-thriftserver.sh).
In case of yarn cluster - Driver will be inside AM container, so you can tweak cores with spark.driver.cores. Other cores will be used by executors (1 executor = 1 core by default)
Beware that --num-executors 2 --executor-cores 4 wouldn't work as you have 8 cores max and +1 will be needed for AM container (total of 9)
You can check cores usage from Spark UI - http://sparkhistoryserverip:18080/history/application_1534847473069_0001/executors/
Options below are only for Spark standalone mode:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
Please review all configs here - Spark Configuration (latest)
In your case you can edit spark-defaults.conf and add:
spark.executor.cores 3
spark.executor.instances 2
Or use local[8] mode as you have only one node anyway.
If you want YARN shows you proper number of cores allocated to executors change value in capacity-scheduler.xml for:
yarn.scheduler.capacity.resource-calculator
from:
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
to:
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
Otherwise it doesn't matter how many cores you ask for your executors, YARN will show you only one core per container.
Actually this config changes resource allocation behavior. More details: https://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/

SPARK_WORKER_INSTANCES setting not working in Spark Standalone Windows

I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to run 8 workers, with a single core per each worker. However, the Spark Master/Worker UI doesn't seem to be reflecting my configuration.
I'm using :
Standalone Spark 2.0
8 Cores 24gig RAM
windows server 2008
pyspark
spark-env.sh file is configured as follows:
SPARK_WORKER_INSTANCES = 8
SPARK_WORKER_CORES = 1
SPARK_WORKER_MEMORY = 2g
spark-defaults.conf is configured as follows:
spark.cores.max = 8
I start the master:
spark-class org.apache.spark.deploy.master.Master
I start the workers by running this command 8 times within a batch file:
spark-class org.apache.spark.deploy.worker.Worker spark://10.0.0.10:7077
The problem is that the UI shows up as follows:
As you can see each worker has 8 cores instead of the 1 core I have assigned it via the SPARK_WORKER_CORES setting. Also the memory is reflective of the entire machine memory not the 2g assigned to each worker. How can I configure Spark to run with 1 core/2g per each worker in standalone mode?
I fixed this to adding the cores and memory arguments to the worker itself.
start spark-class org.apache.spark.deploy.worker.Worker --cores 1 --memory 2g spark://10.0.0.10:7077

how to : spark yarn cluster

I have set up a hadoop cluster with 3 machines one master and 2 slave
In the master i have installed spark
SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true sbt/sbt clean assembly
Added HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-env.sh
Then i ran SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.4.0.jar
I checked localhost:8088 and i saw application SparkPi running..
Is it just this or i should install spark in the 2 slave machines..
How can i get all the machine started?
Is there any help doc out there.. I feel like i am missing something..
In spark standalone more we start the master and worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
i also wanted to know how to get more than one worked running in this case as well
and i know we can can configure slaves in conf/slave but can anyone share an example
Please help i am stuck
Assuming you're using Spark 1.1.0, as it says in the documentation (http://spark.apache.org/docs/1.1.0/submitting-applications.html#master-urls), for the master parameter you can use values yarn-cluster or yarn-client. You do not need to use deploy-mode parameter in that case.
You do not have to install Spark on all the YARN nodes. That is what YARN is for: to distribute your application (in this case Spark) over a Hadoop cluster.

Resources