How spark driver decides which spark executors are to be used? - apache-spark

How a spark driver program decides, which executors are to be used for a particular job?
Is it Data locality Driven?
Are the executors chosen based on the availability of data on that datanode?
If yes, what happens if all the data is present on a single data node and the data node has just enough resources to run 2 executors, but in the spark-submit command we have used --num-executors 4. Which should run 4 executors?
Will spark driver copy some of the data from that datanode to some other datanode and spawn 2 more executors (out of the 4 required executors)?

Related

How Apache Spark collects and coordinate the results from executors

Posting this question to learn how Apache Spark collects and coordinate the results from executors.
Suppose I'm running a job with 3 executors. My DataFrame is partitioned and running across these 3 executors.
So now, When I execute a count() or collect() action on the DataFrame how spark will coordinate the results from these 3 executors?
val prods = spark.read.format("csv").option("header", "true").load("testFile.csv")
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
When you do spark-submit you specify master a client program (driver) starts running on yarn ,if yarn is specified master or local if local specified. https://spark.apache.org/docs/latest/submitting-applications.html
Since you have added tag yarn in the question i am assuming you mean yarn-url,so yarn launches client program(driver) on any of the nodes of cluster and registers and assigns workers (executors) to driver so that task to be executed on each node.Each transformation/action is run parallel on each worker nodes (executor).Once each node complete the job they return back there results to the driver program.
Oki,what part are you not clear ?
Let me make it generic,the client/driver program launches and requests the master local/standalone master/yarn aka Cluster Manager that driver program wants resources to perform tasks ,so allocate driver with the workers for that.The cluster manager in return allocates workers,launches executors on worker nodes and gives the information to client program that you can you use these workers to do your job.So data is divided in each worker node and parallel tasks/transformations are done.Once collect() or count() is called (i assume this is the final part of job).Then each executor return its result back to driver.

Need help in understanding pyspark execution on yarn as Master

I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens
when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I will say my understanding with below example and then I will
come to my confusions
Say I have a file "orderitems" stored on HDFS with some replication factor.
Now I am processing the data by reading this file in to a spark RDD (say , for calculating order revenue).
I have written the code and configured the spark submit as given below
spark-submit \
--master yarn \
--conf spark.ui.port=21888 \
--num-executors 2 \
--executor-memory 512M \
src/main/python/order_revenue.py
Lets assume that I have created the RDD with a partition of 5 and I have executed in yarn-client mode.
Now As per my understanding , once I submit the spark job on YARN,
Request goes to Application manager which is a component of resource
manager.
Application Manager will find one node manager and ask it to launch a
container.
This is the first container of an application and we will call it an
Application Master.
Application master takes over the responsibility of executing and monitoring
the job.
Since I have submitted on client mode,driver program will run on my edge Node/Gateway Node.
I have provided num-executors as 2 and executor memory as 512 mb
Also I have provided no.of partitions for RDD as 5 which means , it will create 5 partitions of data read
and distribute over 5 nodes.
Now here my few confusions over this
I have read in user guide that, partitions of rdd will be distributed to different nodes. Does these nodes are same as the
'Data Nodes' of HDFS cluster? I mean here its 5 partitions, does
this mean its in 5 data nodes?
I have mentioned num-executors as 2.So this 5 partitions of data will utilizes 2 executors(CPU).So my nextquestion is , from where
this 2 executors (CPU) will be picked? I mean 5 partitions are in 5 nodes
right , so does these 2 executors are also in any of these nodes?
The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities,
queues etc. And also a Container is a Linux Control Group which is
a linux kernel feature that allows users to allocate
CPU,memory,Disk I/O and Bandwidth to a user process. So my final
question is Containers are actually provided by "scheduler"?
I am confused here. I have referred architecture, release document and some videos and got messed up.
Expecting helping hands here.
To answer your questions first:
1) Very simply, Executor is spark's worker node and driver is manager node and have nothing to do with hadoop nodes. Assume executors to be processing units (say 2 here) and repartition(5) divides data in 5 chunks to be by these 2 executors and on some basis these data chunks will be divided amongst 2 executors. Repartition data does not create nodes
Spark cluster architecture:
Spark on yarn client mode:
Spark on yarn cluster mode:
For other details you can read the blog post https://sujithjay.com/2018/07/24/Understanding-Apache-Spark-on-YARN/
and https://0x0fff.com/spark-architecture/

Why Spark utilizing only one core per executor? How it decides to utilize cores other than number of partitions?

I am running spark in HPC environment on slurm using Spark standalone mode spark version 1.6.1. The problem is my slurm node is not fully used in the spark standalone mode. I am using spark-submit in my slurm script. There are 16 cores available on a node and I get all 16 cores per executor as I see on SPARK UI. But only one core per executor is actually utilized. top + 1 command on the worker node, where executor process is running, shows that only one cpu is being used out of 16 cpus. I have 255 partitions, so partitions does not seems a problem here.
$SPARK_HOME/bin/spark-submit \
--class se.uu.farmbio.vs.examples.DockerWithML \
--master spark://$MASTER:7077 \
--executor-memory 120G \
--driver-memory 10G \
When I change script to
$SPARK_HOME/bin/spark-submit \
--class se.uu.farmbio.vs.examples.DockerWithML \
--master local[*] \
--executor-memory 120G \
--driver-memory 10G \
I see 0 cores allocated to executor on Spark UI which is understandable because we are no more using spark standalone cluster mode. But now all the cores are utilized when I check top + 1 command on worker node which hints that problem is not with the application code but with the utilization of resources by spark standalone mode.
So how spark decides to use one core per executor when it has 16 cores and also have enough partitions? What can I change so it can utilize all cores?
I am using spark-on-slurm for launching the jobs.
Spark configurations in both cases are as fallows:
--master spark://MASTER:7077
(spark.app.name,DockerWithML)
(spark.jars,file:/proj/b2015245/bin/spark-vs/vs.examples/target/vs.examples-0.0.1-jar-with-dependencies.jar)
(spark.app.id,app-20170427153813-0000)
(spark.executor.memory,120G)
(spark.executor.id,driver)
(spark.driver.memory,10G)
(spark.history.fs.logDirectory,/proj/b2015245/nobackup/eventLogging/)
(spark.externalBlockStore.folderName,spark-75831ca4-1a8b-4364-839e-b035dcf1428d)
(spark.driver.maxResultSize,2g)
(spark.executorEnv.OE_LICENSE,/scratch/10230979/SureChEMBL/oe_license.txt)
(spark.driver.port,34379)
(spark.submit.deployMode,client)
(spark.driver.host,x.x.x.124)
(spark.master,spark://m124.uppmax.uu.se:7077)
--master local[*]
(spark.app.name,DockerWithML)
(spark.app.id,local-1493296508581)
(spark.externalBlockStore.folderName,spark-4098cf14-abad-4453-89cd-3ce3603872f8)
(spark.jars,file:/proj/b2015245/bin/spark-vs/vs.examples/target/vs.examples-0.0.1-jar-with-dependencies.jar)
(spark.driver.maxResultSize,2g)
(spark.master,local[*])
(spark.executor.id,driver)
(spark.submit.deployMode,client)
(spark.driver.memory,10G)
(spark.driver.host,x.x.x.124)
(spark.history.fs.logDirectory,/proj/b2015245/nobackup/eventLogging/)
(spark.executorEnv.OE_LICENSE,/scratch/10230648/SureChEMBL/oe_license.txt)
(spark.driver.port,36008)
Thanks,
The problem is that you only have one worker node. In spark standalone mode, one executor is being launched per worker instances. To launch multiple logical worker instances in order to launch multiple executors within a physical worker, you need to configure this property:
SPARK_WORKER_INSTANCES
By default, it is set to 1. You can increase it accordingly based on the computation you are doing in your code to utilize the amount of resources you have.
You want your job to be distributed among executors to utilize the resources properly,but what's happening is only one executor is getting launched which can't utilize the number of core and the amount of memory you have. So, you are not getting the flavor of spark distributed computation.
You can set SPARK_WORKER_INSTANCES = 5
And allocate 2 cores per executor; so, 10 cores would be utilized properly.
Like this, you tune the configuration to get optimum performance.
Try setting spark.executor.cores (default value is 1)
According to the Spark documentation :
the number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
See https://spark.apache.org/docs/latest/configuration.html
In spark cluster mode you should use command --num-executor "numb_tot_cores*num_of_nodes". For example if you have 3 nodes with 8 cores per node you should write --num-executors 24

Spark number of cores used

I have a very simple spark job which reads million movie ratings and tell the ratings and number of times its rated.
The job is run on the spark cluster and its running fine.
Have couple of questions on the parameter that I use to run the job?
I have 2 nodes runnings.
Node-1 = 24GB RAM & 8 VCPU's.
Node-2 = 8GB RAM & 2 VCPU's.
so totally I have 32GB RAM and 10 VCPU's.
spark-submit command.
spark-submit --master spark://hadoop-master:7077 --executor-memory 4g --num-executors 4 --executor-cores 4 /home/hduser/ratings-counter.py
When I run the above command, which cores spark uses, is it from node-1 or node-2 or does it randomly allocates?
2.If I don't use number of executors what is the default executors spark uses?
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("hadoop-master").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
lines = sc.textFile("hdfs://hadoop-master:8020/user/hduser/gutenberg/ml-10M100K/ratings.dat")
ratings = lines.map(lambda x: x.split('::')[2])
result = ratings.countByValue()
sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.items():
print("%s %i" % (key, value))
is it from node-1 or node-2 or does it randomly allocates?
It really depends on how many workers you have initialized. Since in your spark-submit cmd you have specified a total of 4 executors, each executor will allocate 4gb of memory and 4 cores from the Spark Worker's total memory and cores. One easy way to see in which node each executor was started is to check the Spark's Master UI (default port is 8080) and from there to select your running app. Then you can check the executors tab within the application's UI.
If I don't use number of executors what is the default executors spark uses?
Usually, it initializes one executor per worker instance, and uses all worker's resources.

Spark Streaming : number of executor vs Custom Receiver

Why Spark with one worker node and four executors, each with one core cannot able to process Custom Receiver ??
What are the reason for not processing incoming data via Custom Receiver, if the executor is having a single core in Spark Streaming ?
I am running Spark on Standalone mode. I am getting data in Custom receivers in Spark Streaming app. My laptop is having 4 cores.
master="spark://lappi:7077"
$spark_path/bin/spark-submit --executor-cores 1 --total-executor-cores 4 \
--class "my.class.path.App" \
--master $master
You indicate that your (1) executor should have 1 core reserved for Spark, which means you use 1 of your 4 cores. The parameter total-executor-cores is never limiting since it limits the total amount of cores on your cluster reserved for Spark, which is, per your previous setting, 1.
The Receiver consumes one thread for consuming data out of your one available, which means you have no core left to process data. All of this is explained in the doc:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
You want to bump that executor-cores parameter to 4.

Resources