what is the difference between sparksession.config() and spark.conf.set() - apache-spark

I tried use both ways to set spark.dynamicAllocation.minExecutors, but it seems like that only the first way works
spark2 = SparkSession \
.builder \
.appName("test") \
.config("spark.dynamicAllocation.minExecutors", 15) \
.getOrCreate()
vs.
spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)

It is not so much about the difference between the methods, as the difference in the context in which these are executed.
pyspark.sql.session.SparkSession.Builder options can be executed before Spark application has been started. This means that, if there is no active SparkSession to be retrieved, some cluster specific options can be still set.
If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
pyspark.sql.conf.RuntimeConfig can be retrieved only from exiting session, therefore its set method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.
In general RuntimeConfig.set is used to modify spark.sql.* configuration parameters, which normally can be changed on runtime.
Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using any of these methods, and can be modified only through spark-submit arguments or using configuration files.

I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors) cannot be set using spark2.conf.set vs SparkSession.config?
spark.dynamicAllocation.minExecutors is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.
The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.
With that said, let me answer your question directly, some configurations require to be set before a SparkSession instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors is among the configurations that cannot be changed after an instance of SparkSession has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).

Some config properties need to be set before the SparkSession starts for them to work. Sparksession uses them at the time of initialization. If u set spark.dynamicAllocation.minExecutors after the creation of sparksession there will still be a change in the value for that property in sparConf object and u can verify that by printing the property but it does not affect the sparksession session as it took the value present at the time of the initialization.

Related

Get SparkSession in partition loop [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

Build a SparkSession

I have a spark as interpreter in Zeppelin.
I'm using a Spark2.0, I built a Session: Create
In general you should not initialize SparkSession nor SparkContext in Zeppelin. Zeppelin notebooks are configured to create session for you, and their correct behavior depends on using provided objects.
Initializing your SparkSession will break core Zeppelin functionalities, and multiple SparkContexts will break things completely in the worst case scenario.
Is set spark.driver.allowMultipleContexts to False is best to do a tests ?
You should never use spark.driver.allowMultipleContexts - it not supported, and doesn't guarantee correct results.

Spark job throwing NPE [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

SnappyData - snappy-job - cannot run jar file

I'm trying run jar file from snappydata cli.
I'm just want to create a sparkSession and SnappyData session on beginning.
package io.test
import org.apache.spark.sql.{SnappySession, SparkSession}
object snappyTest {
def main(args: Array[String]) {
val spark: SparkSession = SparkSession
.builder
.appName("SparkApp")
.master("local")
.getOrCreate
val snappy = new SnappySession(spark.sparkContext)
}
}
From sbt file:
name := "SnappyPoc"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.0.0"
When I'm debuging code in IDE, it works fine, but when I create a jar file and try to run it directly on snappy I get message:
"message": "Ask timed out on [Actor[akka://SnappyLeadJobServer/user/context-supervisor/snappyContext1508488669865777900#1900831413]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
I have Spark Standalone 2.1.1, SnappyData 1.0.0.
I added dependencies to Spark instance.
Could you help me ?. Thank in advanced.
The difference between "embedded" mode and "smart connector" mode needs to be explained first.
Normally when you run a job using spark-submit, then it spawns a set of new executor JVMs as per configuration to run the code. However in the embedded mode of SnappyData, the nodes hosting the data also host long-running Spark Executors themselves. This is done to minimize data movement (i.e. move execution rather than data). For that mode you can submit a job (using snappy-job.sh) which will run the code on those pre-existing executors. Alternative routes include the JDBC/ODBC for embedded execution. This also means that you cannot (yet) use spark-submit to run embedded jobs because that will spawn its own JVMs.
The "smart connector" mode is the normal way in which other Spark connectors work but like all those has the disadvantage of having to pull the required data into the executor JVMs and thus will be slower than embedded mode. For configuring the same, one has to specify "snappydata.connection" property to point to the thrift server running on SnappyData cluster's locator. It is useful for many cases where users want to expand the execution capacity of cluster (e.g. if cluster's embedded execution is saturated all the time on CPU), or for existing Spark distributions/deployments. Needless to say that spark-submit can work in the connector mode just fine. What is "smart" about this mode is: a) if physical nodes hosting the data and running executors are common, then partitions will be routed to those executors as much as possible to minimize network usage, b) will use the optimized SnappyData plans to scan the tables, hash aggregation, hash join.
For this specific question, the answer is: runSnappyJob will receive the SnappySession object as argument which should be used rather than creating it. Rest of the body that uses SnappySession will be exactly same. Likewise for working with base SparkContext, it might be easier to implement SparkJob and code will be similar except that SparkContext will be provided as function argument which should be used. The reason being as explained above: embedded mode already has a running SparkContext which needs to be used for jobs.
I think there were missing methods isValidJob and runSnappyJob.
When I added those to code it works, but know someone what is releation beetwen body of metod runSnappyJob and method main
Should be the same in both ?

Spark configuration priority

Does there any difference or priority between specifying spark application configuration in the code :
SparkConf().setMaster(yarn)
and specifying them in command line
spark-submit --master yarn
Yes, the highest priority is given to the configuration in the user's code with the set() function. After that there the flags passed with spark-submit.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source
There are 4 precedence level: (1 to 4 , 1 being the highest priority):
SparkConf set in the application
Properties given with the spark-submit
Properties can be given in a property file. And the
property file can be given as argument while submission
Default values
Other than the priority, specifying it on a command-line would allow you to run on different cluster managers without modifying code. The same application can be run on local[n] or yarn or mesos or spark standalone cluster.

Resources