Build a SparkSession - apache-spark

I have a spark as interpreter in Zeppelin.
I'm using a Spark2.0, I built a Session: Create

In general you should not initialize SparkSession nor SparkContext in Zeppelin. Zeppelin notebooks are configured to create session for you, and their correct behavior depends on using provided objects.
Initializing your SparkSession will break core Zeppelin functionalities, and multiple SparkContexts will break things completely in the worst case scenario.
Is set spark.driver.allowMultipleContexts to False is best to do a tests ?
You should never use spark.driver.allowMultipleContexts - it not supported, and doesn't guarantee correct results.

Related

Why I don't need to create a SparkSession in Databricks?

Why I don't need to create a SparkSession in Databricks? Is a SparkSession created automatically when the cluster is configured? Or somebodyelse did it for me?
That is done only in the notebooks, to simplify user's work & avoiding them to specify different parameters, many of them won't have any effect because Spark is already started. This behavior is similar to what you get when you start spark-shell or pyspark - both of them initialize the SparkSession and SparkContext:
Spark context available as 'sc' (master = local[*], app id = local-1635579272032).
SparkSession available as 'spark'.
But if you're running code from jar or Python wheel as job, then it's your responsibility to create corresponding objects.
In Databricks environment, Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.ref: link

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set to true, SparkContext can be created in executors.
But I wonder that, How can SparkContext be created in executors? As far as I know SparkContext is created on driver, and executors are assigned by resource manager. So SparkContext is always created before executors.
What is the use case of this config?
From the Spark Core migration 3.0 to 3.1:
In Spark 3.0 and below, SparkContext can be created in executors.
Since Spark 3.1, an exception will be thrown when creating
SparkContext in executors. You can allow it by setting the
configuration spark.executor.allowSparkContext when creating
SparkContext in executors.
As per this issue SPARK-32160, since version 3.1 there is a check added when creating SparkContext (see for pyspark pyspark/context.py) which prevents executors from creating SparkContext:
if (conf is None or
conf.get("spark.executor.allowSparkContext", "false").lower() != "true"):
# In order to prevent SparkContext from being created in executors.
SparkContext._assert_on_driver()
# ...
#staticmethod
def _assert_on_driver():
"""
Called to ensure that SparkContext is created only on the Driver.
Throws an exception if a SparkContext is about to be created in executors.
"""
if TaskContext.get() is not None:
raise Exception("SparkContext should only be created and accessed on the driver.")
An error in the docs and, or implementation I suggest.
The whole concept makes no sense if you (as you do) understand the Spark architecture. No announcement has been made otherwise about this.
From the other answer and plentiful doc of errors on this aspect it is clear something went awry.

what is the difference between sparksession.config() and spark.conf.set()

I tried use both ways to set spark.dynamicAllocation.minExecutors, but it seems like that only the first way works
spark2 = SparkSession \
.builder \
.appName("test") \
.config("spark.dynamicAllocation.minExecutors", 15) \
.getOrCreate()
vs.
spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)
It is not so much about the difference between the methods, as the difference in the context in which these are executed.
pyspark.sql.session.SparkSession.Builder options can be executed before Spark application has been started. This means that, if there is no active SparkSession to be retrieved, some cluster specific options can be still set.
If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
pyspark.sql.conf.RuntimeConfig can be retrieved only from exiting session, therefore its set method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.
In general RuntimeConfig.set is used to modify spark.sql.* configuration parameters, which normally can be changed on runtime.
Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using any of these methods, and can be modified only through spark-submit arguments or using configuration files.
I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors) cannot be set using spark2.conf.set vs SparkSession.config?
spark.dynamicAllocation.minExecutors is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.
The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.
With that said, let me answer your question directly, some configurations require to be set before a SparkSession instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors is among the configurations that cannot be changed after an instance of SparkSession has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).
Some config properties need to be set before the SparkSession starts for them to work. Sparksession uses them at the time of initialization. If u set spark.dynamicAllocation.minExecutors after the creation of sparksession there will still be a change in the value for that property in sparConf object and u can verify that by printing the property but it does not affect the sparksession session as it took the value present at the time of the initialization.

SnappyData - snappy-job - cannot run jar file

I'm trying run jar file from snappydata cli.
I'm just want to create a sparkSession and SnappyData session on beginning.
package io.test
import org.apache.spark.sql.{SnappySession, SparkSession}
object snappyTest {
def main(args: Array[String]) {
val spark: SparkSession = SparkSession
.builder
.appName("SparkApp")
.master("local")
.getOrCreate
val snappy = new SnappySession(spark.sparkContext)
}
}
From sbt file:
name := "SnappyPoc"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.0.0"
When I'm debuging code in IDE, it works fine, but when I create a jar file and try to run it directly on snappy I get message:
"message": "Ask timed out on [Actor[akka://SnappyLeadJobServer/user/context-supervisor/snappyContext1508488669865777900#1900831413]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
I have Spark Standalone 2.1.1, SnappyData 1.0.0.
I added dependencies to Spark instance.
Could you help me ?. Thank in advanced.
The difference between "embedded" mode and "smart connector" mode needs to be explained first.
Normally when you run a job using spark-submit, then it spawns a set of new executor JVMs as per configuration to run the code. However in the embedded mode of SnappyData, the nodes hosting the data also host long-running Spark Executors themselves. This is done to minimize data movement (i.e. move execution rather than data). For that mode you can submit a job (using snappy-job.sh) which will run the code on those pre-existing executors. Alternative routes include the JDBC/ODBC for embedded execution. This also means that you cannot (yet) use spark-submit to run embedded jobs because that will spawn its own JVMs.
The "smart connector" mode is the normal way in which other Spark connectors work but like all those has the disadvantage of having to pull the required data into the executor JVMs and thus will be slower than embedded mode. For configuring the same, one has to specify "snappydata.connection" property to point to the thrift server running on SnappyData cluster's locator. It is useful for many cases where users want to expand the execution capacity of cluster (e.g. if cluster's embedded execution is saturated all the time on CPU), or for existing Spark distributions/deployments. Needless to say that spark-submit can work in the connector mode just fine. What is "smart" about this mode is: a) if physical nodes hosting the data and running executors are common, then partitions will be routed to those executors as much as possible to minimize network usage, b) will use the optimized SnappyData plans to scan the tables, hash aggregation, hash join.
For this specific question, the answer is: runSnappyJob will receive the SnappySession object as argument which should be used rather than creating it. Rest of the body that uses SnappySession will be exactly same. Likewise for working with base SparkContext, it might be easier to implement SparkJob and code will be similar except that SparkContext will be provided as function argument which should be used. The reason being as explained above: embedded mode already has a running SparkContext which needs to be used for jobs.
I think there were missing methods isValidJob and runSnappyJob.
When I added those to code it works, but know someone what is releation beetwen body of metod runSnappyJob and method main
Should be the same in both ?

In Apache Spark SQL, How to close metastore connection from HiveContext

My project has unit tests for different HiveContext configurations (sometimes they are in one file as they are grouped by features.)
After upgrading to Spark 1.4 I encounter a lot of 'java.sql.SQLException: Another instance of Derby may have already booted the database' problems, as a patch make those contexts unable to share the same metastore. Since its not clean to revert state of a singleton for every test. My only option boils down to "recycle" each context by terminating the previous Derby metastore connection. Is there a way to do this?
Well in scala I just used FunSuite for Unit Tests together with BeforeAndAfterAll trait. Then you can just init your sparkContext in beforeAll, spawn your HiveContext from it and finish it like this:
override def afterAll(): Unit = {
if(sparkContext != null)
sparkContext .stop()
}
From what I've noticed it also closes a HiveContext attached to it.

Resources