I have a query regarding creating multiple spark sessions in one JVM. SPARK-2243 says that creating multiple Sparkcontexts is not supported. Is it true for the SparkSession in Spark 2.0 as well. I also read about the cloneSession method in SparkSession API, which allows creating an identical copy of a SparkSession and share the underlying SparkContext between the two sessions. Does this mean that there can be multiple SparkSessions in one JVM?
Related
I don't understand how the spark session/context lifecycle works. The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed? For example, if I have a production cluster and I spark-submit 10 ETLs, will these 10 jobs share the same SparkContext? Does it matter if I do this in cluster/client mode?
To the best of my understanding, the SparkContext lives in the driver so I assume the above would result in one SparkContext shared by 10 SparkSessions, but I'm not at all sure I got this correctly... Any clarification will be much appreciated.
Let's understand sparkSession and sparkContext
SparkContext is a channel to access all Spark functionality.The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs and knows what resource manager (YARN) to communicate to.And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
With Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry.
It means SparkSession Encapsulates SparkContext.
Let’s say we have multiple users accessing the same notebook which had shared sparkContext and the requirement was to have an isolated environment sharing the same spark context. Prior to 2.0, the solution to this was to create multiple sparkContexts ie sparkContext per isolated environment or users and is an expensive operation(a single sparkContext exists per JVM). But with the introduction of the spark session, this issue has been addressed.
I spark-submit 10 ETLs, will these 10 jobs share the same
SparkContext? Does it matter if I do this in cluster/client mode? To
the best of my understanding, the SparkContext lives in the driver so
I assume the above would result in one SparkContext shared by 10
SparkSessions,
If you submit 10 ETL spark-submit jobs whether it is cluster/client they all are different application and they have their own sparkContext and sparkSession.In native spark you can't share objects between different application but if you want share objects you have to use share contexts(spark-jobserver).Multiple option are available like Apache Ivy, apache-ignite
There is one SparkContext per spark application.
The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed?
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
If you have an existing spark session and want to create new one, use the newSession method on the existing SparkSession.
import org.apache.spark.sql.{SQLContext, SparkSession}
val newSparkSession1 = spark.newSession()
val newSparkSession2 = spark.newSession()
The newSession method creates a new spark session with isolated SQL configurations, temporary tables.The new session will share the underlying SparkContext and cached data.
Then you can use these different sessions to submit different jobs/sql queries.
newSparkSession1.sql("<ETL-1>")
newSparkSession2.sql("<ETL-2>")
Does it matter if I do this in cluster/client mode?
Client/Cluster mode doesn't matter.
I have around 10 Spark jobs where each would do some transformation and load data into Database. The Spark session has to be opened individually for each job and closed and every time initialization consumes time.
Is it possible to create the Spark session only once and re-use the same across multiple jobs ?
Technically if you use a single Spark Session you will end-up having a single Spark application, because you will have to package and run multiple ETL (Extract, Transform, & Load) within a single JAR file.
If you are running those jobs in production cluster, most likely you are using spark-submit to execute your application jar, which will have to go through initializing phase every-time you submits a job through Spark Master -> Workers in client mode.
In general, having a long running spark session is mostly suitable for prototyping, troubleshooting and debugging purposes, for example a single spark session can be leveraged in spark-shell, or any other interactive development environment, like Zeppelin; but, not with spark-submit as far as I know.
All in all, a couple of design/business questions is worth to consider here; does merging multiple ETL jobs together will generate a code that is easy to sustain, manage and debug? Does it provide the required performance gain? Risk/Cost analysis ? etc.
Hope this would help
You can submit your job once, in other words do spark-submit once. Inside the code which is submitted you can have 10 calls each doing some transformation and load data into Database.
val spark : SparkSession = SparkSession.builder
.appName("Multiple-jobs")
.master("<cluster name>")
.getOrCreate()
method1()
method2()
def method1():Unit = {
//it will give the same spark session created outside the method.
val spark = SparkSession.builder.getOrCreate()
//work
}
However if the job is time consuming say it takes 10 minutes then in comparision you wouldn't be spending a lot of time in creating separate spark sessions. I wouldn't worry about 1 spark session per job. However I will be worried if a separate Spark session is created per method or per unit test case, that is where I will save spark sessions.
Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory
Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory
I have an RMI cluster. Each RMI server has a Spark context.
Is there any way to share an RDD between different Spark contexts?
As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin). Since RDD or DataFrame is just a small local object there is really not much to share.
Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.
No, an RDD is tied to a single SparkContext. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext and kick off operations on the RDDs.
If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.
You cant natively, in my understanding, RDD is not data, but a way to create data via transformations/filters from original data.
Another idea, is to share the final data instead. So, you will store the RDD in a data-store, such as :
- HDFS (a parquet file etc..)
- Elasticsearch
- Apache Ignite (in-memory)
I think you will love Apache Ignite: https://ignite.apache.org/features/igniterdd.html
Apache Ignite provides an implementation of Spark RDD abstraction
which allows to easily share state in memory across multiple Spark
jobs, either within the same application or between different Spark
applications.
IgniteRDD is implemented is as a view over a distributed Ignite cache,
which may be deployed either within the Spark job executing process,
or on a Spark worker, or in its own cluster.
(I let you dig their documentation to find what you are looking for.)