In Databricks Spark, how is SparkContext shared across different processes? - apache-spark

In https://docs.databricks.com/workflows/jobs/jobs.html#use-the-shared-sparkcontext it says:
Because Databricks initializes the SparkContext, programs that invoke
new SparkContext() will fail. To get the SparkContext, use only the
shared SparkContext created by Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
In SparkContext#getOrCreate it says:
This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext. This method allows not passing a SparkConf
(useful if just retrieving).
In SparkSession.Builder#getOrCreate it says:
Gets an existing SparkSession or, if there is no existing one, creates
a new one based on the options set in this builder. This method first
checks whether there is a valid thread-local SparkSession, and if yes,
return that one. It then checks whether there is a valid global
default SparkSession, and if yes, return that one. If no valid global
default SparkSession exists, the method creates a new SparkSession and
assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the non-static config
options specified in this builder will be applied to the existing
SparkSession.
So my understanding is that Databricks somehow creates a SparkContext in some process, probably a JVM, and then executes the submitted JAR in a different JVM process. Is this understanding correct?
If it is, then how does the SparkContext sharing mechanism work across multiple processes?
If not, then what actually happens and how does SparkContext get shared?
Thanks

On Databricks, SparkContext/SparkSession are created when cluster is starting, and then you submitted jar is executed in the same JVM where SparkContext/SparkSession was created. The recommendations about not stopping SparkContext especially important when you are submitting job to the interactive cluster (not recommended for multiple reasons).
When you're using Python or R, you get separate Python/R processes, but they will use the same SparkContext.

Related

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set to true, SparkContext can be created in executors.
But I wonder that, How can SparkContext be created in executors? As far as I know SparkContext is created on driver, and executors are assigned by resource manager. So SparkContext is always created before executors.
What is the use case of this config?
From the Spark Core migration 3.0 to 3.1:
In Spark 3.0 and below, SparkContext can be created in executors.
Since Spark 3.1, an exception will be thrown when creating
SparkContext in executors. You can allow it by setting the
configuration spark.executor.allowSparkContext when creating
SparkContext in executors.
As per this issue SPARK-32160, since version 3.1 there is a check added when creating SparkContext (see for pyspark pyspark/context.py) which prevents executors from creating SparkContext:
if (conf is None or
conf.get("spark.executor.allowSparkContext", "false").lower() != "true"):
# In order to prevent SparkContext from being created in executors.
SparkContext._assert_on_driver()
# ...
#staticmethod
def _assert_on_driver():
"""
Called to ensure that SparkContext is created only on the Driver.
Throws an exception if a SparkContext is about to be created in executors.
"""
if TaskContext.get() is not None:
raise Exception("SparkContext should only be created and accessed on the driver.")
An error in the docs and, or implementation I suggest.
The whole concept makes no sense if you (as you do) understand the Spark architecture. No announcement has been made otherwise about this.
From the other answer and plentiful doc of errors on this aspect it is clear something went awry.

Get instance of Azure data bricks Spark in Python code

I am developing a python package which will be deployed into databricks cluster. We often need reference to the "spark" and "dbutils" object within the python code.
We can access these objects easily within Notebook using "spark" (like spark.sql()). How do we get the spark instance within the python code in the package?
SparkSession.Builder.getOrCreate:
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default
So whenever you need instance of SparkSession and don't want to pass it as an argument:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

How to use getOrCreate() method in SparkContext class and what exactly is the functionality we achieve from this method

What is the use of getOrCreate() method in SparkContext Class and how I can use it? I did not found any suitable example(coding wise) for this.
What I understand is that using above method I can share spark context between applications. What do we mean by applications here?
Is application a different job submitted to a spark cluster?
If so then we should be able to use global variables(broadcast) and temp tables registered in one application into another application ?
Please if anyone can elaborate and give suitable example on this.
As given in the Javadoc for SparkContext, getOrCreate() is useful when applications may wish to share a SparkContext. So yes, you can use it to share a SparkContext object across Applications. And yes, you can re-use broadcast variables and temp tables across.
As for understanding Spark Applications, please refer this link. In short, an application is the highest-level unit of computation in Spark. And what you submit to a spark cluster is not a job, but an application. Invoking an action inside a Spark application triggers the launch of a job to fulfill it.
getOrCreate
public SparkSession getOrCreate()
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local SparkSession and if yes, return that one. It then checks whether there is a valid global default SparkSession and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.
Please check link: [https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html][1]
An example can be :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

In Apache Spark SQL, How to close metastore connection from HiveContext

My project has unit tests for different HiveContext configurations (sometimes they are in one file as they are grouped by features.)
After upgrading to Spark 1.4 I encounter a lot of 'java.sql.SQLException: Another instance of Derby may have already booted the database' problems, as a patch make those contexts unable to share the same metastore. Since its not clean to revert state of a singleton for every test. My only option boils down to "recycle" each context by terminating the previous Derby metastore connection. Is there a way to do this?
Well in scala I just used FunSuite for Unit Tests together with BeforeAndAfterAll trait. Then you can just init your sparkContext in beforeAll, spawn your HiveContext from it and finish it like this:
override def afterAll(): Unit = {
if(sparkContext != null)
sparkContext .stop()
}
From what I've noticed it also closes a HiveContext attached to it.

Resources