We're submitting jobs to a standalone cluster in cluster deploy mode (i.e. driver running in Spark) and would like to be able to track the progress of the jobs and instrument them.
In order to do that, we need to know the ID of the driver that Spark assigned to the driver process but I have not seen any way to obtain that information from within the running application, as it is not exposed in any way via the Spark Context (only the application ID is exposed, which is not the same).
Am I missing something or is there really no way to know the driverId from within the executing code?
I am not sure how to get the driver ID. However, there is one thing that we can think of doing. Each job once submitted is identified by an Application Id in the Yarn Resource Manager. So, the tracking of the Driver which is running within Application Master (due to cluster mode) may be possible provided you go through the logs related to the Application Master/Yarn/Resource manager/Application id.
Related
I couldn't figure out what is the difference between Spark driver and application master. Basically the responsibilities in running an application, who does what?
In client mode, client machine has the driver and app master runs in one of the cluster nodes. In cluster mode, client doesn't have any, driver and app master runs in same node (one of the cluster nodes).
What exactly are the operations that driver do and app master do?
References:
Spark Driver memory and Application Master memory
Spark yarn cluster vs client - how to choose which one to use?
As per the spark documentation
Spark Driver :
The Driver(aka driver program) is responsible for converting a user
application to smaller execution units called tasks and then schedules
them to run with a cluster manager on executors. The driver is also
responsible for executing the Spark application and returning the
status/results to the user.
Spark Driver contains various components – DAGScheduler,
TaskScheduler, BackendScheduler and BlockManager. They are responsible
for the translation of user code into actual Spark jobs executed on
the cluster.
Where in Application Master is
The Application Master is responsible for the execution of a single
application. It asks for containers from the Resource Scheduler
(Resource Manager) and executes specific programs on the obtained containers.
Application Master is just a broker that negotiates resources with the Resource Manager and then after getting some container it make sure to launch tasks(which are picked from scheduler queue) on containers.
In a nutshell Driver program will translate your custom logic into stages, job and task.. and your application master will make sure to get enough resources from RM And also make sure to check the status of your tasks running in a container.
as it is already said in your provided references the only different between client and cluster mode is
In client, mode driver will run on the machine where we have executed/run spark application/job and AM runs in one of the cluster nodes.
(AND)
In cluster mode driver run inside application master, it means the application has much more responsibility.
References :
https://luminousmen.com/post/spark-anatomy-of-spark-application#:~:text=The%20Driver(aka%20driver%20program,status%2Fresults%20to%20the%20user.
https://www.edureka.co/community/1043/difference-between-application-master-application-manager#:~:text=The%20Application%20Master%20is%20responsible,class)%20on%20the%20obtained%20containers.
I am aware of the basics of YARN framework, however I still feel lack of some understanding, in regards to MapReduce.
With YARN, I have read that MapReduce is just one of the applications which can run on top of YARN; for example, with YARN, on same cluster various different jobs can run, MapReduce Jobs, Spark Jobs etc.
Now, the point is, each type of job has its "own" kind of "Job phases", for example, when we talk about MapReduce, it has various phases like, Mapper, Sorting, Shuffle, Reducer etc.
Specific to this scenario, who "decides", "controls" these phases? Is it MapReduce Framework?
As I understand, YARN is an infrastructure on which different jobs run; so when we submit a MapReduce Job, does it first go to MapReduce framework and then the code is executed by YARN? I have this doubt, because YARN is general purpose execution engine, so it won't be having knowledge of mapper, reducer etc., which is specific to MapReduce (and so different kind of Jobs), so does MapReduce Framework run on top of YARN, with YARN help executing the Jobs, and MapReduce Framework is aware of the phases it has to go through for a particular kind of Job?
Any clarification to understand this would be of great help.
If you take a look at this picture from Hadoop documentation:
You'll see that there's no particular "job orchestration" component, but a resource requesting component, called application master. As you mentioned, YARN does resource management and with regards to application orchestration, it stops at an abstract level.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
When applied to Spark, some of the components in that picture would be:
Client: the spark-submit process
App Master: Spark's application master that runs driver and application master (cluster mode) or just application master (client mode)
Container: spark workers
Spark's YARN infrastructure provides the application master (in YARN terms), which knows about Spark's architecture. So when the driver runs, either in cluster mode or in client mode, it still decides on jobs/stages/tasks. This must be application/framework-specific (Spark being the "framework" when it comes to YARN).
From Spark documentation on YARN deployment:
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN
You can extend this abstraction to map-reduce, given your understanding of that framework.
So when we submit a MapReduce job it will first go to the Resource Manager which is the master daemon of YARN. The Resource Manager then selects a Node Manager(which are slave processes of YARN) to start a container on which it will ask the Node Manager to start a very lightweight process known as Application Master. Then the Resource Manager will ask the Application Master to start execution of the job.
The Application Master will first go through the driver part of the job from where it would get to know of the resources that would be used for the job, and accordingly it will request the Resource Manager for those resources. Now the Resource Manager can assign the resources to the Application Master immediately or if the cluster is to occupied then that request would be rescheduled based on various scheduling algorithms.
After getting the resources the Application Master would go to the Name Node to get the metadata of all the blocks that would be required to be processed for this job.
After receiving the Metadata the Application Master would ask the Node Managers of the nodes where the blocks are stored(if those nodes are too busy then a node in the same rack, otherwise any random node depending on rack awareness) and ask the Node Managers to launch containers for processing their respective blocks.
The blocks would get processed independently and in parallel in their respective nodes. After the entire processing is done the result would be stored in HDFS.
After deploying a spark structure streaming application, how can I obtain a spark session on the executor for deploying another job with the same session and same configuration settings?
You cannot get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
I may be able to help you with this if you can tell me the problem statement.
Technically you can get spark session on the executor doesn't matter which mode you are running it in but not really worth the effort.Spark session is an object of various internal spark settings along with other user defined settings we provide on startup.
The only reason those configuration settings are not available in executor is because most of them are marked as transient which means those objects will be sent as null as it does not make logical sense to send them to the executors, in the same way it does not make sense to send database connection objects from one node to another.
One of the cumbersome ways to do this would be to get all configuration settings from your spark session in your driver, set in some custom object marked as serializable and send it to the executor. Also your executor environment should be same as driver in terms of all spark jars/directories and other spark properties such as SPARK_HOME etc which can be hectic if you run and realize every time you are missing something. It will be a different spark session object but with all the same settings.
The better option would be to run another spark application with the same settings you provide for your other application as one spark session is associated for one spark application.
It is not possible. I also had similar requirement then I have to create two separate main class and one spark launcher class in that I was doing sparksession.conf.set(main class name ) based on which class i wanted to run. If I want to run both then I was using thread.sleep() to complete first before launching another. I also used sparkListener code to get status whether it has completed or not.
I am aware that this is a late response. Just thought this might be useful.
So, you can use something like the below code snippet in you spark structured streaming application:
for spark versions <= 3.2.1
spark_session_for_this_micro_batch = microBatchOutputDF._jdf.sparkSession()
For spark versions >= 3.3.1:
spark_session_for_this_micro_batch = microBatchOutputDF.sparkSession
Your function can use this spark session to create dataframe there.
You can refer this medium post
pyspark doc
Is there a way to find a driver to application mapping in cluster mode??
I understand, on submitting an application the CreateSubmissionResponse would return the driver-Id which could be used to monitor or kill the driver program. I am trying to see if there is any alternate way of doing it without storing the driver id.
I saw Driver UI http://<driver>:4040 which gives the application information under Environment section, but spark documentation mentions
"If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc)."
This makes it difficult to map which driver runs on which port.
So is there a way to get all driver id and their applications.
Environment: Spark standalone with Zookeeper as Cluster manager.
Any help is appreciated!
Thanks
If you have Spark history server up and running, the simplest way is to
use its REST API
http://spark.apache.org/docs/latest/monitoring.html#rest-api. Check
/applications endpoint
I am trying to implement a simple Spark SQL Application that takes a query as input and processes the data. But because I need to cache the data and I have to maintain a single SQL Context object. I am not able to understand how I can use same SQL context and keep getting queries from user.
So how does an application work? When an application is submitted to cluster, does it keep running on the cluster or performs a specific task and shuts down immediately after the task?
Spark application has a driver program that starts and configures the Spark Context. Driver program can be inside your application and you can use the same Spark Context throughout the life of your application.
Spark Context is thread safe, so multiple users can use it to run jobs concurrently.
There is an open source project Zeppelin that does just that.