Use case of spark.executor.allowSparkContext - apache-spark

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set to true, SparkContext can be created in executors.
But I wonder that, How can SparkContext be created in executors? As far as I know SparkContext is created on driver, and executors are assigned by resource manager. So SparkContext is always created before executors.
What is the use case of this config?

From the Spark Core migration 3.0 to 3.1:
In Spark 3.0 and below, SparkContext can be created in executors.
Since Spark 3.1, an exception will be thrown when creating
SparkContext in executors. You can allow it by setting the
configuration spark.executor.allowSparkContext when creating
SparkContext in executors.
As per this issue SPARK-32160, since version 3.1 there is a check added when creating SparkContext (see for pyspark pyspark/context.py) which prevents executors from creating SparkContext:
if (conf is None or
conf.get("spark.executor.allowSparkContext", "false").lower() != "true"):
# In order to prevent SparkContext from being created in executors.
SparkContext._assert_on_driver()
# ...
#staticmethod
def _assert_on_driver():
"""
Called to ensure that SparkContext is created only on the Driver.
Throws an exception if a SparkContext is about to be created in executors.
"""
if TaskContext.get() is not None:
raise Exception("SparkContext should only be created and accessed on the driver.")

An error in the docs and, or implementation I suggest.
The whole concept makes no sense if you (as you do) understand the Spark architecture. No announcement has been made otherwise about this.
From the other answer and plentiful doc of errors on this aspect it is clear something went awry.

Related

In Databricks Spark, how is SparkContext shared across different processes?

In https://docs.databricks.com/workflows/jobs/jobs.html#use-the-shared-sparkcontext it says:
Because Databricks initializes the SparkContext, programs that invoke
new SparkContext() will fail. To get the SparkContext, use only the
shared SparkContext created by Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
In SparkContext#getOrCreate it says:
This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext. This method allows not passing a SparkConf
(useful if just retrieving).
In SparkSession.Builder#getOrCreate it says:
Gets an existing SparkSession or, if there is no existing one, creates
a new one based on the options set in this builder. This method first
checks whether there is a valid thread-local SparkSession, and if yes,
return that one. It then checks whether there is a valid global
default SparkSession, and if yes, return that one. If no valid global
default SparkSession exists, the method creates a new SparkSession and
assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the non-static config
options specified in this builder will be applied to the existing
SparkSession.
So my understanding is that Databricks somehow creates a SparkContext in some process, probably a JVM, and then executes the submitted JAR in a different JVM process. Is this understanding correct?
If it is, then how does the SparkContext sharing mechanism work across multiple processes?
If not, then what actually happens and how does SparkContext get shared?
Thanks
On Databricks, SparkContext/SparkSession are created when cluster is starting, and then you submitted jar is executed in the same JVM where SparkContext/SparkSession was created. The recommendations about not stopping SparkContext especially important when you are submitting job to the interactive cluster (not recommended for multiple reasons).
When you're using Python or R, you get separate Python/R processes, but they will use the same SparkContext.

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

Can't find a proper way to close Java Spark session(with spark version 2.2.1)

I'm deploy a spark on yarn driver java application,it will submit spark jobs(mainly doing some offline statistics over hive,elasticsearch and hbase) to cluster when Task Scheduling System gives it a task call.So I make this driver app keep running,always wait for request.
I use thread pool to handle task calls,Every task will open a new
SparkSession and close it when job finished(we skip multiple tasks
call at the same time scenario to simplify this question).Java code should be like this:
SparkSession sparkSession=SparkSession.builder().config(new SparkConf().setAppName(appName)).enableHiveSupport().getOrCreate();
......doing statistics......
sparkSession.close();
This app compiled and running under jdk8 and memory related configured as fellow:
spark.ui.enabled=false
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.driver.memory=2G
--driver-java-options "-XX:MaxDirectMemorySize=2048M -XX:+UseG1GC"
At first glance I thought this driver app will consumes at most 4G memories,but as it keeps running,TOP shows it takes more and more resident size.
I dumped it's heap file and saw many Spark related instances left in threadLocal after sparksession closed,such as Hive metastore,SparkSession itself.After many study,I find Spark using a lot of threadlocals and havn't remove them(or I just havn't use the right way to close sparksession) I add those codes to clear threadlocals that spark left behind:
import org.apache.hadoop.hive.ql.metadata.Hive;
import org.apache.hadoop.hive.ql.session.SessionState;
......
SparkSession.clearDefaultSession();
sparkSession.close();
Hive.closeCurrent();
SessionState.detachSession();
SparkSession.clearActiveSession();
This seems work for now,but I think it's just not decent enough,I'm wondering is there a better way to do it like another single spark java api can do all the cleaning?I just can't find a clue from spark document.

Build a SparkSession

I have a spark as interpreter in Zeppelin.
I'm using a Spark2.0, I built a Session: Create
In general you should not initialize SparkSession nor SparkContext in Zeppelin. Zeppelin notebooks are configured to create session for you, and their correct behavior depends on using provided objects.
Initializing your SparkSession will break core Zeppelin functionalities, and multiple SparkContexts will break things completely in the worst case scenario.
Is set spark.driver.allowMultipleContexts to False is best to do a tests ?
You should never use spark.driver.allowMultipleContexts - it not supported, and doesn't guarantee correct results.

stop all existing spark contexts

I am trying to create a new Spark context using pyspark, and i get the following:
WARN SparkContext: Another SparkContext is being constructed (or threw
an exception in its constructor). This may indicate an error, since
only one SparkContext may be running in this JVM (see SPARK-2243). The
other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
I do not have any other context active (in theory), but maybe it did not finish correctly and it is still there. How can I find out if there is other or kill all the current ones? I am using spark 1.5.1
When you run pyspark shell and execute python script inside it, e.g., using 'execfile()', SparkContext available as sc, HiveContext available as sqlContext. To run a python script without any contexts just use ./bin/spark-submit 'your_python_file'.

Resources