stop all existing spark contexts - apache-spark

I am trying to create a new Spark context using pyspark, and i get the following:
WARN SparkContext: Another SparkContext is being constructed (or threw
an exception in its constructor). This may indicate an error, since
only one SparkContext may be running in this JVM (see SPARK-2243). The
other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
I do not have any other context active (in theory), but maybe it did not finish correctly and it is still there. How can I find out if there is other or kill all the current ones? I am using spark 1.5.1

When you run pyspark shell and execute python script inside it, e.g., using 'execfile()', SparkContext available as sc, HiveContext available as sqlContext. To run a python script without any contexts just use ./bin/spark-submit 'your_python_file'.

Related

Creation of SparkContext in PySpark

I am wondering which process creates SparkContext in PySpark applications. I understood that there will be a Python main process and a JVM one and it sounded like the python main script will spawn the JVM one but not so sure that's accurate. I have two questions:
Which one actually creates SparkContext? I am guessing it is the JVM one and it will be passed to the Python main process?
If I run my PySpark via spark-submit, it looks like it is the JVM process? Would this be the one creating SparkContext?
Overall it is not clear to me whether there are any differences in running Pyspark codes with spark-submit vs. with python interpreter (python3 my_pyspark.py for example) in terms of which one creates the SparkContext.

What is interpreter-aware SparkContext

In the spark docs, there is mentioned a line
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc.
reference : https://spark.apache.org/docs/latest/rdd-programming-guide.html
What does interpreter-aware SparkContext means here ?
You can run the spark-shell with python or scala. The spark context knows which one. As it is interactive there is an interpreter. It's a common computing concept, that is all.

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set to true, SparkContext can be created in executors.
But I wonder that, How can SparkContext be created in executors? As far as I know SparkContext is created on driver, and executors are assigned by resource manager. So SparkContext is always created before executors.
What is the use case of this config?
From the Spark Core migration 3.0 to 3.1:
In Spark 3.0 and below, SparkContext can be created in executors.
Since Spark 3.1, an exception will be thrown when creating
SparkContext in executors. You can allow it by setting the
configuration spark.executor.allowSparkContext when creating
SparkContext in executors.
As per this issue SPARK-32160, since version 3.1 there is a check added when creating SparkContext (see for pyspark pyspark/context.py) which prevents executors from creating SparkContext:
if (conf is None or
conf.get("spark.executor.allowSparkContext", "false").lower() != "true"):
# In order to prevent SparkContext from being created in executors.
SparkContext._assert_on_driver()
# ...
#staticmethod
def _assert_on_driver():
"""
Called to ensure that SparkContext is created only on the Driver.
Throws an exception if a SparkContext is about to be created in executors.
"""
if TaskContext.get() is not None:
raise Exception("SparkContext should only be created and accessed on the driver.")
An error in the docs and, or implementation I suggest.
The whole concept makes no sense if you (as you do) understand the Spark architecture. No announcement has been made otherwise about this.
From the other answer and plentiful doc of errors on this aspect it is clear something went awry.

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often).
The sql query just performs create table as select from registered temp view from DataFrame.
The first idea was spark.catalog.clearCache() in the end of the job (did't help).
Also I found some post on databricks forum about using object ... extends App (Scala) instead of main method (didn't help again)
P.S. current_date() is the built-in function and it should be provided automatically (expected)
Spark 2.4.4, Scala 2.11, Databricks Runtime 6.2
org.apache.spark.sql.AnalysisException: Undefined function: 'current_date'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 21 pos 4
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1317)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)```
Solution, ensure spark initialized every time when job is executed.
TL;DR,
I had similar issue and that object extends App solution pointed me in right direction. So, in my case I was creating spark session outside of the "main" but within object and when job was executed first time cluster/driver loaded jar and initialised spark variable and once job has finished execution successfully (first time) the jar is kept it in memory but link to spark is lost for some reason and any subsequent execution does not reinitialize spark as jar is already loaded and in my case spark initilisation was outside main and hence was not re-initilised. I think it's not an issue for Databricks jobs that create cluster and run or start cluster before execution (as these are similar to first time start case) and only related to clusters that already up and running as jars are loaded during either cluster start up or job execution.
So, I moved spark creation i.e. SparkSession.builder()...getOrCreate() to the "main" and so when job called so does spark session gets reinitialized.
current_date() is the built-in function and it should be provided
automatically (expected)
This expectation is wrong. you have to import the functions
for scala
import org.apache.spark.sql.functions._
where current_date function is available.
from pyspark.sql import functions as F
for pyspark

Why does SparkContext randomly close, and how do you restart it from Zeppelin?

I am working in Zeppelin writing spark-sql queries and sometimes I suddenly start getting this error (after not changing code):
Cannot call methods on a stopped SparkContext.
Then the output says further down:
The currently active SparkContext was created at:
(No active SparkContext.)
This obviously doesn't make sense. Is this a bug in Zeppelin? Or am I doing something wrong? How can I restart the SparkContext?
Thank you
I have faced this problem a couple of times.
If you are setting your master as yarn-client, it might be due to the stop / restart of Resource Manager, the interpreter process may still be running but the Spark Context (which is a Yarn application) does not exists any more.
You could check if Spark Context is still running by consulting your Resource manager web Interface and check if there is an application named Zeppelin running.
Sometimes restarting the interpreter process from within Zeppelin (interpreter tab --> spark --> restart) will solve the problem.
Other times you need to:
kill the Spark interpreter process from the command line
remove the Spark Interpreter PID file
and the next time you start a paragraph it will start new spark context
I'm facing the same problem running multiple jobs in PySpark. Seems that in Spark 2.0.0, with SparkSession, when I call spark.stop() SparkSession calls the following trace:
# SparkSession
self._sc.stop()
# SparkContext.stop()
self._jsc = None
Then, when I try to create a new job with new a SparkContext, SparkSession return the same SparkContext than before with self.jsc = None.
I solved setting SparkSession._instantiatedContext = None after spark.stop() forcing SparkSession to create a new SparkContext next time that I demand.
It's not the best option, but meanwhile it's solving my issue.
I've noticed this issue more when running pyspark commands even with trivial variable declarations that a cell execution hangs in running state.
As mentioned above by user1314742, just killing the relevant PID solves this issue for me.
e.g.:
ps -ef | grep zeppelin
This is where restarting the Spark interpreter and restarting zeppelin notebook does not solve the issue. I guess because it cannot control the hung PID itself.
Could you check your driver memory is enough or not ? I solved this issue by
enlarge driver memory
tune GC:
--conf spark.cleaner.periodicGC.interval=60
--conf spark.cleaner.referenceTracking.blocking=false

Resources