Creating many, short-living SparkSessions - apache-spark

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.
So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?
I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?

This shows how multiple sessions can be build with different configures
Use
spark1.clearActiveSession();
spark1.clearDefaultSession();
To clear the sessions.
SparkSession spark1 = SparkSession.builder()
.master("local[*]")
.appName("app1")
.getOrCreate();
Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
df.show();
spark1.clearActiveSession();
spark1.clearDefaultSession();
SparkSession spark2 = SparkSession.builder()
.master("local[*]")
.appName("app2")
.getOrCreate();
Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
df2.show();
For your questions.
Spark context save the rdds in memory for quicker processing.
If there is lot of data . The save tables or rdds are moved to the hdd .
A session can access the tables if it saved as a view at any point.
It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

Related

Starting a Spark Session without executors

I have a use-case where I need to use some Spark's API without actually performing any data processing. For example: I want to read the schema of some Hive table with spark.table(table_name).schema.
I want the process to be fast and lightweight. Specifically, I want to avoid the relatively long wait time to get the resources when starting. Is there a way to get a limited Spark Session with just the driver JVM and no executors at all?
The best I managed is this, but I wanted to see if I can make it even lighter:
spark = (
SparkSession
.builder
.enableHiveSupport()
.master("local[1]")
.config("spark.executor.instances", "1")
.config("spark.executor.cores", "1")
.config("spark.executor.memory", "450m")
.config("spark.executor.memoryOverhead", "0")
.config("spark.shuffle.service.enabled", "false")
.config("spark.dynamicAllocation.enabled", "false")
.config("spark.ui.enabled", "false")
)
Just to clear up your line of thought:
In local mode, the Driver and Executors are created in a single JVM. But there are no real Executors; there are just N cores for the Spark App to use.
So you are good with local[1], but you need not state this executore-params.

Does Spark Sql executions use thread local jobgroup?

From my findings running multiple sparksqls with different job groups does not put them in the specified groups.
https://issues.apache.org/jira/browse/SPARK-29340
Creating new threadlocal jobgroup works for spark dataframe jobs but not for sparksql. Is there a way to put all threadlocal spark sql executions in a separate jobgroup?
val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()
sparkThreadLocal.sparkContext.setJobGroup("<id>", "<description>")
OR
sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", "<id>")
sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", "<description>")
Solved! It was an issue with using scala parallel iteration, which uses threadpools.

How many Spark Session to create?

We are building a data ingestion framework in pyspark.
The first step is to get/create a sparksession with our app name. The structure of dataLoader.py is outlined below.
spark = SparkSession \
.builder \
.appName('POC') \
.enableHiveSupport() \
.getOrCreate()
#create data frame from file
#process file
If i have to execute this dataLoader.py concurrently for loading different files, would having the same spark session cause an issue?
Do I have to create a separate spark session for every ingestion?
No, you don't create multiple spark session. Spark session should be created only once per spark application. Spark doesn't support this and your job might will fail if you use multiple spark session in the same spark job. Here is the SPARK-2243 where spark has closed the ticket saying it won't fix it.
If you want to load different files using the dataLoader.pythere are 2 options
Load and process files sequentially. Here you load one file at a time; save that to a dataframe and process that dataframe.
Create different dataLoader.py script for different files and run each spark job in parallel. Here each spark job gets its own sparkSession.
Yet another option is to create a Spark session once, share it among several threads and enable FAIR job scheduling. Each of the threads would execute a separate spark job, i.e. calling collect or other action on a data frame. The optimal number of threads depends on complexity of your job and the size of the cluster. If there are too few jobs, the cluster can be underloaded and wasting its resources. If there are too many threads, the cluster will be saturated and some jobs will be sitting idle and waiting for executors to free up.
Each spark job is independent and there can only be one instance of SparkSession ( and SparkContext ) per JVM. You won't be able to create multiple session instances.
You want to create a new spark application for every file which is certainly possible as each spark application would have 1 corresponding spark session, it is not the recommended way though (usually).You can load multiple files using the same spark session object which is preferred (usually).

How many SparkSessions can a single application have?

I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I try to write temporary results to Hive tables (on HDFS), the system still doesn't free much memory, and my entire system crashes after about 130 joins.
However, through experimentation, I realized that if I break the problem into smaller pieces, write temporary results to hive tables, and Stop/Start the Spark session (and spark context), then the system's resources are freed. I was able to join over 1,000 columns using this approach.
But I can't find any documentation to understand if this is considered a good practice or not (I know you should not acquire multiple sessions at once). Most systems acquire the session in the beginning and close it in the end. I could also break the application into smaller ones, and use a driver like Oozie to schedule these smaller applications on Yarn. But this approach would start and stop the JVM at each stage, which seems a bit heavy-weight.
So my question: is it bad practice to continually start/stop the spark session to free system resources during the run of a single spark application?
But can you elaborate on what you mean by a single SparkContext on a single JVM? I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession. I then created a new SparkSession and used a new sparkContext. No error was thrown.
I was also able to use this on the JavaSparkPi without any problems.
I have tested this in yarn-client and a local spark install.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
TL;DR You can have as many SparkSessions as needed.
You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.
But can you elaborate on what you mean by a single SparkContext on a single JVM?
It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.
The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).
You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).
In other words, have a single SparkContext for the entire lifetime of your Spark application.
There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.
I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession.
So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?
SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.
From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).
If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.
I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.
// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]
// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)
scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2
// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100
import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]
I then created a new SparkSession and used a new SparkContext. No error was thrown.
Again, how does this contradict what I said about a single SparkContext being available? I'm curious.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?
Try the following:
Stop SparkContext
Execute any processing using Spark Core's RDD or Spark SQL's Dataset APIs
An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

Why can't we create an RDD using Spark session

We see that,
Spark context available as 'sc'.
Spark session available as 'spark'.
I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.
scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
val a = spark.textFile("Sample.txt")
As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.
In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:
spark.sparkContext().textFile(yourFileOrURL)
see the documentation on this access method here.
Note that in PySpark this would become:
spark.sparkContext.textFile(yourFileOrURL)
see the documentation here.
In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.
For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.
But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.
All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.
sparkContext still contains the method which it had in previous
version .
methods of sparkSession can be found here
It can be created in the following way-
val a = spark.read.text("wc.txt")
This will create a dataframe,If you want to convert it to RDD then use-
a.rdd Please refer the link below,on dataset API-
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html

Resources