What is pyspark driver? [closed] - apache-spark

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I saw that a common setup to start pyspark is using pyspark --master yarn --deploy-mode client --num-executors 4 --executor-memory 2g --driver-memory 4g, but how does driver memory differ from the executory memory? Could you please explain what a driver is and how does setting it here affects the pyspark workflow/performance?
Thanks!

Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.
DRIVER
The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.
EXECUTORS
Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.

A Spark Driver is the process of running the main() function of the application and creating the SparkContext.
--driver-memory setup the memory used by this driver. If you run your application in client mode, this will most probably be the max-memory use by the Master Node. The master node is only used to coordinate jobs between the executors, so he's not really used to do any computation.
The memory in the Driver can be filled calling an operation an Action like collect that return a list that contains all of the elements in this RDD. If the RDD is bigger than the driver memory, the Spark Application will through an OutOfMemory error.
Here you can find more information about Spark components: http://spark.apache.org/docs/latest/cluster-overview.html#components

This is a wonderful link describing the various parameters for tuning spark applications.
It includes description for driver memory, executor memory, etc.
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Spark Job Architecture
A Spark application consists of a single driver process and a set of executor processes scattered across nodes on the cluster.
The driver is the process that is in charge of the high-level control flow of work that needs to be done. The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache. Both the driver and the executors typically stick around for the entire time the application is running, although dynamic resource allocation changes that for the latter. A single executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime. Deploying these processes on the cluster is up to the cluster manager in use (YARN, Mesos, or Spark Standalone), but the driver and executor themselves exist in every Spark application.

Related

In a Spark cluster, is there one driver process per machine or one driver process per cluster?

I am reading Spark: The Definitive Guide and am learning a great deal.
However, one thing I am confused about while reading is how many driver processes there are per Spark job. When the text first introduces driver and executor processes, it implies that there is a driver per machine, but in the discussion about broadcast variables, it sounds as though there is one driver per cluster.
This is because the text talks about the driver process sending the broadcast variable to every node in the cluster so that it can be cached there. That makes it sound as though there is only one driver process in the whole cluster that is responsible for this.
Which one is it: one driver process per cluster, or one per machine? Or can it be both? I think I am missing something here.
In Spark architecture, there will be only one driver for your spark application.
The spark driver, as part of the spark application is responsible for instantiating a spark session. The spark driver has multiple responsibilities
It communicates with the cluster manager (CM).
Requests resources from the CM for spark's executor JVMs.
Transforms all spark operations into DAG computations, schedules them and distributes their execution as tasks across all spark executors.
It's interaction with the CM is merely to get Spark executor resources.
So, the workflow of running spark applications on a cluster can be seen as:
The user submits an application using spark-submit.
spark-submit launches the driver program and invokes the main method specified by the user.
The driver program contacts the cluster manager to ask for resources to start executor.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD or dataset operations in the program, the driver sends work to executors in the form of tasks.
The tasks are run on executor process to compute and save result.

What is the exact difference between Spark Local and Standalone mode? [duplicate]

This question already has an answer here:
Difference between spark standalone and local mode?
(1 answer)
Closed 4 years ago.
Can someone mention the difference regarding these factors
Number of nodes / Machines
Memory
Cores
Setup
Deployment
Advantages of each mode
When they should be used
Examples if possible
Also if I am running spark locally on single Laptop then is that Local mode or Standalone?
There is a huge difference between standalone and local.
Local - means that it runs on your pc locally i.e. not distributed.
Standalone - means that spark will handle resource management.
Standalone, for this I will give you some background so you can better understand what it means.
Spark is a distributed application which consume resources i.e. memory cpu and more...
lets assume that you run 2 spark applications at the same time, this can cause an error when allocating resources. for example it may happen that the first job consumes all the memory and the second job would fail because he doesn't have memory.
To resolve this issue you need to use some resource manager that will guarantee that your job can run without any problem with resources.
Standalone, means that spark will handle the management of the resources on the cluster. there are also other resource management tools like Yarn or Mesos.
Overall you have 3 options for managing resources on the cluster:
Mesos, Yarn , Standalone.
I would also mention that on a real Hadoop cluster, spark is not the only application that is running on your cluster, which means it is not the only consumer of resources. you can also run HBase,TEZ, IMPALA. Yarn would help you to allocate resources to all of those applications.

can someone let me know how to decide --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores

How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.
Also the clear difference between cluster and client deploy mode. How to choose the deploy mode
The first part of your question where you ask about --executor-memory, --num-executors and --num-executor-cores usually depends on the variety of task your Spark application is going to perform.
Executor Memory indicates the amount of physical memory you want to allocate to the JVM that runs the executor. The value will depend on your requirement. For example, if you're just going to parse a large text file you'll require much less memory than what you need for, say, Image Processing.
The number of executors variable is the number of Executor JVMs you want to spawn on your cluster. Again, it depends on a lot of factors like your cluster size, type of machines in the cluster etc.
Each executor splits the code and performs the instructions in tasks. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.
On to your second part of the question, we have two --deploy-mode in Spark that you have already named i.e. cluster and client.
client mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit job, if you run a job with --deploy-mode client, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).
cluster mode: When you specify --deploy-mode cluster in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.

Is there a way to understand which executors were used for a job in java/scala code?

I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.
I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.

Spark Driver in Apache spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)
The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>
Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.
This is how the configuration of your driver looks like:
val conf = new SparkConf()
.setMaster("master_url") // this is where the master is specified
.setAppName("SparkExamplesMinimal")
.set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
.set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version
val sc = new spark.SparkContext(conf)
// etc...
To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"
Assume you start from a spark-submit --master yarn cmd :
The cmd will request yarn Resource Manager (RM) to start a ApplicationMaster (AM)process on one of your cluster machines (those have yarn node manager installled on it).
Once the AM started, it will call your driver program's main method. So the driver is actually where you define your spark context, your rdd, and your jobs. The driver contains the entry main method which start the spark computation.
The spark context will prepare RPC endpoint for the executor to talk back, and a lot of other things(memory store, disk block manager, jetty server...)
The AM will request RM for containers to run your spark executors, with the driver RPC url (something like spark://CoarseGrainedScheduler#ip:37444) specified on the executor's start cmd.
The Yellow box "Spark context" is the Driver.
A Spark driver is the process that creates and owns an instance of SparkContext. It is your
Spark application that launches the main method in which the instance of SparkContext is
created. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment
It splits a Spark application into tasks and schedules them to run on executors.
A driver is where the task scheduler lives and spawns tasks across workers.
A driver coordinates workers and overall execution of tasks.
In simple term, Spark driver is a program which contains the main method (main method is the starting point of your program). So, in Java, driver will be the Class which will contain public static void main(String args[]).
In a cluster, you can run this program in either one of the ways:
1) In any remote host machine. Here you'll have to provide the remote host machine details while submitting the driver program on to the remote host. The driver runs in the JVM process created in remote machine and only comes back with final result.
2) Locally from your client machine(Your laptop). Here the driver program runs in JVM process created in your machine locally. From here it sends the task to remote hosts and wait for the result from each tasks.
If you set config "spark.deploy.mode = cluster", then your driver will be launched at your worker hosts(ubuntu2 or ubuntu3).
If spark.deploy.mode=driver, which is the default value, then the driver will run on the machine your submit your application.
And finally, you can see your application on web UI: http://driverhost:driver_ui_port, where the driver_ui_port is default 4040, and you can change the port by set config "spark.ui.port"
Spark driver is node that allows application to create SparkContext, sparkcontext is connection to compute resource.
Now driver can run the box it is submitted or it can run on one of node of cluster when using some resource manager like YARN.
Both options of client/cluster has some tradeoff like
Access to CPU/Memory of once of the node on cluster, some time this is good because cluster node will be big in terms memory.
Driver logs are on cluster node vs local box from where job was submitted.
You should have history server for cluster mode other wise driver side logs are lost.
Some time it is hard to install dependency(i.e some native dependency) executor and running spark application in client mode comes to rescue.
If you want to read more on Spark Job anatomy then http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post could be useful
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Spark cluster components
There are several useful things to note about this architecture:
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Resources