I don't understand how the spark session/context lifecycle works. The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed? For example, if I have a production cluster and I spark-submit 10 ETLs, will these 10 jobs share the same SparkContext? Does it matter if I do this in cluster/client mode?
To the best of my understanding, the SparkContext lives in the driver so I assume the above would result in one SparkContext shared by 10 SparkSessions, but I'm not at all sure I got this correctly... Any clarification will be much appreciated.
Let's understand sparkSession and sparkContext
SparkContext is a channel to access all Spark functionality.The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs and knows what resource manager (YARN) to communicate to.And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
With Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry.
It means SparkSession Encapsulates SparkContext.
Let’s say we have multiple users accessing the same notebook which had shared sparkContext and the requirement was to have an isolated environment sharing the same spark context. Prior to 2.0, the solution to this was to create multiple sparkContexts ie sparkContext per isolated environment or users and is an expensive operation(a single sparkContext exists per JVM). But with the introduction of the spark session, this issue has been addressed.
I spark-submit 10 ETLs, will these 10 jobs share the same
SparkContext? Does it matter if I do this in cluster/client mode? To
the best of my understanding, the SparkContext lives in the driver so
I assume the above would result in one SparkContext shared by 10
SparkSessions,
If you submit 10 ETL spark-submit jobs whether it is cluster/client they all are different application and they have their own sparkContext and sparkSession.In native spark you can't share objects between different application but if you want share objects you have to use share contexts(spark-jobserver).Multiple option are available like Apache Ivy, apache-ignite
There is one SparkContext per spark application.
The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed?
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
If you have an existing spark session and want to create new one, use the newSession method on the existing SparkSession.
import org.apache.spark.sql.{SQLContext, SparkSession}
val newSparkSession1 = spark.newSession()
val newSparkSession2 = spark.newSession()
The newSession method creates a new spark session with isolated SQL configurations, temporary tables.The new session will share the underlying SparkContext and cached data.
Then you can use these different sessions to submit different jobs/sql queries.
newSparkSession1.sql("<ETL-1>")
newSparkSession2.sql("<ETL-2>")
Does it matter if I do this in cluster/client mode?
Client/Cluster mode doesn't matter.
Related
What’s a purpose of sparkContext and sparkConf ? Looking for detailed difference.
More than below definition:
Spark Context was the entry point of any spark application and used to access all spark features and needed a sparkConf which had all the cluster configs and parameters to create a Spark Context object.
The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI.
SparkConf is passed into SparkContext so our driver application knows how to access the cluster.
Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors.
Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.
I have around 10 Spark jobs where each would do some transformation and load data into Database. The Spark session has to be opened individually for each job and closed and every time initialization consumes time.
Is it possible to create the Spark session only once and re-use the same across multiple jobs ?
Technically if you use a single Spark Session you will end-up having a single Spark application, because you will have to package and run multiple ETL (Extract, Transform, & Load) within a single JAR file.
If you are running those jobs in production cluster, most likely you are using spark-submit to execute your application jar, which will have to go through initializing phase every-time you submits a job through Spark Master -> Workers in client mode.
In general, having a long running spark session is mostly suitable for prototyping, troubleshooting and debugging purposes, for example a single spark session can be leveraged in spark-shell, or any other interactive development environment, like Zeppelin; but, not with spark-submit as far as I know.
All in all, a couple of design/business questions is worth to consider here; does merging multiple ETL jobs together will generate a code that is easy to sustain, manage and debug? Does it provide the required performance gain? Risk/Cost analysis ? etc.
Hope this would help
You can submit your job once, in other words do spark-submit once. Inside the code which is submitted you can have 10 calls each doing some transformation and load data into Database.
val spark : SparkSession = SparkSession.builder
.appName("Multiple-jobs")
.master("<cluster name>")
.getOrCreate()
method1()
method2()
def method1():Unit = {
//it will give the same spark session created outside the method.
val spark = SparkSession.builder.getOrCreate()
//work
}
However if the job is time consuming say it takes 10 minutes then in comparision you wouldn't be spending a lot of time in creating separate spark sessions. I wouldn't worry about 1 spark session per job. However I will be worried if a separate Spark session is created per method or per unit test case, that is where I will save spark sessions.
I have a query regarding creating multiple spark sessions in one JVM. SPARK-2243 says that creating multiple Sparkcontexts is not supported. Is it true for the SparkSession in Spark 2.0 as well. I also read about the cloneSession method in SparkSession API, which allows creating an identical copy of a SparkSession and share the underlying SparkContext between the two sessions. Does this mean that there can be multiple SparkSessions in one JVM?
Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory
Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory