pySpark Context Listeners - apache-spark

I am trying to use DI - Dependency Injection in my code but the nature of Spark - Driver vs Executor is making things tough. DI is trying to make sure that the objects are created during the bootstrap, it works best for the driver but not for the executor. so I am planning to bootstrap the objects for executor as part of it's creation or loaded etc., I am looking for a listener that can I listen to spark context events for executor; I tried looking into the different listeners but none is useful for me. If I have missed or overlooked, is there a listener for spark context or for executor?

Related

Get Spark session on executor

After deploying a spark structure streaming application, how can I obtain a spark session on the executor for deploying another job with the same session and same configuration settings?
You cannot get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
I may be able to help you with this if you can tell me the problem statement.
Technically you can get spark session on the executor doesn't matter which mode you are running it in but not really worth the effort.Spark session is an object of various internal spark settings along with other user defined settings we provide on startup.
The only reason those configuration settings are not available in executor is because most of them are marked as transient which means those objects will be sent as null as it does not make logical sense to send them to the executors, in the same way it does not make sense to send database connection objects from one node to another.
One of the cumbersome ways to do this would be to get all configuration settings from your spark session in your driver, set in some custom object marked as serializable and send it to the executor. Also your executor environment should be same as driver in terms of all spark jars/directories and other spark properties such as SPARK_HOME etc which can be hectic if you run and realize every time you are missing something. It will be a different spark session object but with all the same settings.
The better option would be to run another spark application with the same settings you provide for your other application as one spark session is associated for one spark application.
It is not possible. I also had similar requirement then I have to create two separate main class and one spark launcher class in that I was doing sparksession.conf.set(main class name ) based on which class i wanted to run. If I want to run both then I was using thread.sleep() to complete first before launching another. I also used sparkListener code to get status whether it has completed or not.
I am aware that this is a late response. Just thought this might be useful.
So, you can use something like the below code snippet in you spark structured streaming application:
for spark versions <= 3.2.1
spark_session_for_this_micro_batch = microBatchOutputDF._jdf.sparkSession()
For spark versions >= 3.3.1:
spark_session_for_this_micro_batch = microBatchOutputDF.sparkSession
Your function can use this spark session to create dataframe there.
You can refer this medium post
pyspark doc

SparkContext.getOrCreate() purpose

What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method.
If I have 2 spark applications that are run with spark-submit, and in the main method I instantiate the spark context with SparkContext.getOrCreate, both app will have the same context?
Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton object?
If I have 2 spark applications that are run with spark-submit, and in the main method I instantiate the spark context with SparkContext.getOrCreate, both app will have the same context?
No, SparkContext is a local object. It is not shared between applications.
when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton object?
This is exactly the reason. SparkContext (or SparkSession) are ubiquitous in Spark applications and core Spark's source, and passing them around would a huge burden.
It also useful for multithreaded applications where arbitrary thread can initalize contexts.
About docs:
is function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.
Driver runs in its own JVM and there is no built-in mechanism to share it between multiple full-fledged Java applications (proper application executing its own main. Check Is there one JVM per Java application? and Why have one JVM per application? for related general questions). Application refers to "logical application" where multiple modules execute its own code - one example is SparkJob on spark-jobserver. This scenario is no different than passing SparkContext to a function.

How to display step-by-step execution of sequence of statements in Spark application?

I have an Apache Spark data loading and transformation application with pyspark.sql that runs for half an hour before throwing an AttributeError or other run-time exceptions.
I want to test my application end-to-end with a small data sample, something like Apache Pig's ILLUSTRATE. Sampling down the data does not help much. Is there a simple way to do this?
It sounds like an idea that could easily be handled by a SparkListener. It gives you access to all the low-level details that the web UI of any Spark application could ever be able to show you. All the events that are flying between the driver (namely DAGScheduler and TaskScheduler with SchedulerBackend) and executors are posted to registered SparkListeners, too.
A Spark listener is an implementation of the SparkListener developer API (that is an extension of SparkListenerInterface where all the callback methods are no-op/do-nothing).
Spark uses Spark listeners for web UI, event persistence (for Spark History Server), dynamic allocation of executors and other services.
You can develop your own custom Spark listeners and register them using SparkContext.addSparkListener method or spark.extraListeners setting.
Go to a Spark UI of your job and you will find a DAG Visualization there. That's a graph representing your job
To test your job on a sample use sample as an input first of all ;) Also you may run your spark locally, not on a cluster and then debug it in IDE of your choice (like IDEA)
More info:
This great answer explaining DAG
DAG introduction from DataBricks

How to create a custom apache Spark scheduler?

I have a p2p mesh network of nodes. It has its own balancing and given a task T can reliably execute it (if one node fails another will continue). My mesh network has Java and Python apis. I wonder what are the steps needed to make Spark call my API to lunch tasks?
Oh boy, that's a really broad question, but I agree with Daniel. If you really want to do this, you could first start with:
Scheduler Backends, which states things like:
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines
become available and can launch tasks on them. Once a scheduler
backend obtains the resource allocation, it can start executors.
TaskScheduler, since you need to understand how tasks are meant
to be scheduled to build a scheduler, which mentions things like
this:
A TaskScheduler gets sets of tasks (as TaskSets) submitted to it from the DAGScheduler for each stage, and is responsible for sending
the tasks to the cluster, running them, retrying if there are
failures, and mitigating stragglers.
An important concept here is the Dependency Acyclic Graph (GDA),
where you can take a look at its GitHub page.
You can also read
What is the difference between FAILED AND ERROR in spark application states to get an intuition.
Spark Listeners — Intercepting Events from Spark can also come
in handy:
Spark Listeners intercept events from the Spark scheduler that are emitted over the course of execution of Spark applications.
You could take first the Exercise: Developing Custom SparkListener
to monitor DAGScheduler in Scala to check your understanding.
In general, Mastering Apache Spark 2.0 seems to have plenty of resources, but I will not list more here.
Then, you have to meet the Final Boss in this game, Spark's Scheduler GitHub page and get the feel. Hopefully, all this will be enough to get you started! :)
Take a look at how existing schedulers (YARN and Mesos) are implemented.
Implement the scheduler for your system.
Contribute your changes to the Apache Spark project.

Programmatically access list of live Spark nodes

I've implemented a custom data layer on Spark that has Spark node persisting some data locally and announcing their persistence of data to the Spark master. This works great by running some custom code on each Spark node and master that we've written, but now I'd like to implement a replication protocol across my cluster. What I'd like to build is that once the master gets a message from a node saying it's persisted data, that the master can randomly select two other nodes and have them persist the same data.
I've been digging through the docs but I don't see an obvious way of the SparkContext giving me a list of live nodes. Am I missing something?
There isnt a public API for doing this. However, you could use the Developer API SparkListener (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener). You can create a custom SparkListener class and add it to the SparkContext as
sc.addSparkListener(yourListener)
The system will class the onBlockManagerAdded and onBlockManagerRemoved when a BlockManager gets added or removed, and from the BlockManager's ID, I believe you can get the URL of the nodes running the Spark live executors (which run BlockManagers).
I agree that this is a little hacky. :)

Resources