Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link
Related
In https://docs.databricks.com/workflows/jobs/jobs.html#use-the-shared-sparkcontext it says:
Because Databricks initializes the SparkContext, programs that invoke
new SparkContext() will fail. To get the SparkContext, use only the
shared SparkContext created by Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
In SparkContext#getOrCreate it says:
This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext. This method allows not passing a SparkConf
(useful if just retrieving).
In SparkSession.Builder#getOrCreate it says:
Gets an existing SparkSession or, if there is no existing one, creates
a new one based on the options set in this builder. This method first
checks whether there is a valid thread-local SparkSession, and if yes,
return that one. It then checks whether there is a valid global
default SparkSession, and if yes, return that one. If no valid global
default SparkSession exists, the method creates a new SparkSession and
assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the non-static config
options specified in this builder will be applied to the existing
SparkSession.
So my understanding is that Databricks somehow creates a SparkContext in some process, probably a JVM, and then executes the submitted JAR in a different JVM process. Is this understanding correct?
If it is, then how does the SparkContext sharing mechanism work across multiple processes?
If not, then what actually happens and how does SparkContext get shared?
Thanks
On Databricks, SparkContext/SparkSession are created when cluster is starting, and then you submitted jar is executed in the same JVM where SparkContext/SparkSession was created. The recommendations about not stopping SparkContext especially important when you are submitting job to the interactive cluster (not recommended for multiple reasons).
When you're using Python or R, you get separate Python/R processes, but they will use the same SparkContext.
Can anyone explain the difference between SparkContext, SQLContext, HiveContext and SparkSession EntryPoints and each one's usecases.
SparkContext is used for basic RDD API on both Spark1.x and Spark2.x
SparkSession is used for DataFrame API and Struct Streaming API on Spark2.x
SQLContext & HiveContext are used for DataFrame API on Spark1.x and deprecated from Spark2.x
Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. SparkContext is a Class to access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()
What is the use of getOrCreate() method in SparkContext Class and how I can use it? I did not found any suitable example(coding wise) for this.
What I understand is that using above method I can share spark context between applications. What do we mean by applications here?
Is application a different job submitted to a spark cluster?
If so then we should be able to use global variables(broadcast) and temp tables registered in one application into another application ?
Please if anyone can elaborate and give suitable example on this.
As given in the Javadoc for SparkContext, getOrCreate() is useful when applications may wish to share a SparkContext. So yes, you can use it to share a SparkContext object across Applications. And yes, you can re-use broadcast variables and temp tables across.
As for understanding Spark Applications, please refer this link. In short, an application is the highest-level unit of computation in Spark. And what you submit to a spark cluster is not a job, but an application. Invoking an action inside a Spark application triggers the launch of a job to fulfill it.
getOrCreate
public SparkSession getOrCreate()
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local SparkSession and if yes, return that one. It then checks whether there is a valid global default SparkSession and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.
Please check link: [https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html][1]
An example can be :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
In Spark 1.6, a SparkEnv is automatically created after the creating a new SparkContext object.
In Spark 2.0, SparkSession was introduced as the entry point to Spark SQL.
Is SparkEnv created automatically after the creation of SparkSession in Spark 2?
Yes, SparkEnv, SparkConf and SparkContext are all automatically created when SparkSession is created (and that's why corresponding code in Spark SQL is more high-level and hopefully less error-prone).
SparkEnv is a part of Spark runtime infrastructure and is required to have all the Spark Core's low-level services up and running before you can use the high-level APIs in Spark SQL (or Spark MLlib). Nothing has changed here.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.sparkContext
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext#1e86506c
We see that,
Spark context available as 'sc'.
Spark session available as 'spark'.
I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.
scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
val a = spark.textFile("Sample.txt")
As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.
In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:
spark.sparkContext().textFile(yourFileOrURL)
see the documentation on this access method here.
Note that in PySpark this would become:
spark.sparkContext.textFile(yourFileOrURL)
see the documentation here.
In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.
For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.
But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.
All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.
sparkContext still contains the method which it had in previous
version .
methods of sparkSession can be found here
It can be created in the following way-
val a = spark.read.text("wc.txt")
This will create a dataframe,If you want to convert it to RDD then use-
a.rdd Please refer the link below,on dataset API-
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html