Does Spark Sql executions use thread local jobgroup? - apache-spark

From my findings running multiple sparksqls with different job groups does not put them in the specified groups.
https://issues.apache.org/jira/browse/SPARK-29340
Creating new threadlocal jobgroup works for spark dataframe jobs but not for sparksql. Is there a way to put all threadlocal spark sql executions in a separate jobgroup?
val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()
sparkThreadLocal.sparkContext.setJobGroup("<id>", "<description>")
OR
sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", "<id>")
sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", "<description>")

Solved! It was an issue with using scala parallel iteration, which uses threadpools.

Related

How to share a global spark session?

Actually I am working on a project which includes a workflow which consists of multiple tasks and single task consists of multiple components.
for eg.
In join , we need 4 components. 2 for input ( consider two table join) , 1 for join logic and 1 for output ( writing back to hdfs ).
This is one task. , similarly "sort" can be another task.
suppose a workflow with these two tasks and which are linked i.e
after performing join , we are using output of join in our sorting task.
But "join" and "sort" invoke separate "spark sessions".
So the flow is like , one spark session is created for join using spark submit , output is saved in hdfs and current spark session is closed. For the sort another session is created using spark submit and output stored at hdfs by join task is fetched for sorting.
But the problem is there is a overhead of fetching the data from hdfs.
So is there any way I can share a session for different tasks between two spark-submit. So that i will not loose my result dataframe from join and which can be directly used in my next spark-submit of sorting.
So basically I have multiple spark-submit associated with different task . But I want to retain my result in dataframe in memory so that I need not to persist it and can be used in another linked task ( spark-submit)
The spark session builder has a function to get or create the SparkSession.
val spark = SparkSession.builder()
.master("local[*]")
.appName("AzuleneTest")
.getOrCreate()
But reading your description you may be better off passing dataframes between between the functions (which hold a reference to a spark session). This avoids creating multiple instances of spark. For example
//This is run within a single spark application
val df1 = spark.read.parquet("s3://bucket/1/")
val df2 = spark.read.parquet("s3://bucket/1/")
val df3 = df1.join(df2,"id")
df3.write.parquet("s3://bucket/3/")

How to start multiple streaming queries in a single Spark application?

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application. So that spark application should have multiple streaming queries.
I am confused on how to build/start multiple streaming queries within same submit programmatically.
For ex: I have this code:
case class SparkJobs(prop: Properties) extends Serializable {
def run() = {
Type1SparkJobBuilder(prop).build().awaitTermination()
Type1SparkJobBuilder(prop).build().awaitTermination()
}
}
I fire this in my main class with SparkJobs(new Properties()).run()
When I see in the spark history server, only the first spark streaming job (Type1SparkJob) is running.
What is the recommended way to fire multiple streaming queries within same spark submit programatically, I could not find proper documentation either.
Since you're calling awaitTermination on the first query it's going to block until it completes before starting the second query. So you want to kick off both queries, but then use StreamingQueryManager.awaitAnyTermination.
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
In addition to the above, by default Spark uses the FIFO scheduler. Which means the first query gets all resources in the cluster while it's executing. Since you're trying to run multiple queries concurrently you should switch to the FAIR scheduler
If you have some queries that should have more resources than the others then you can also tune the individual scheduler pools.
val query1=ds.writeSteam.{...}.start()
val query2=ds.writeSteam.{...}.start()
val query3=ds.writeSteam.{...}.start()
query3.awaitTermination()
AwaitTermination() will block your process until finish, which will never happen in a streaming app, call it on your last query should fix your problem

How many Spark Session to create?

We are building a data ingestion framework in pyspark.
The first step is to get/create a sparksession with our app name. The structure of dataLoader.py is outlined below.
spark = SparkSession \
.builder \
.appName('POC') \
.enableHiveSupport() \
.getOrCreate()
#create data frame from file
#process file
If i have to execute this dataLoader.py concurrently for loading different files, would having the same spark session cause an issue?
Do I have to create a separate spark session for every ingestion?
No, you don't create multiple spark session. Spark session should be created only once per spark application. Spark doesn't support this and your job might will fail if you use multiple spark session in the same spark job. Here is the SPARK-2243 where spark has closed the ticket saying it won't fix it.
If you want to load different files using the dataLoader.pythere are 2 options
Load and process files sequentially. Here you load one file at a time; save that to a dataframe and process that dataframe.
Create different dataLoader.py script for different files and run each spark job in parallel. Here each spark job gets its own sparkSession.
Yet another option is to create a Spark session once, share it among several threads and enable FAIR job scheduling. Each of the threads would execute a separate spark job, i.e. calling collect or other action on a data frame. The optimal number of threads depends on complexity of your job and the size of the cluster. If there are too few jobs, the cluster can be underloaded and wasting its resources. If there are too many threads, the cluster will be saturated and some jobs will be sitting idle and waiting for executors to free up.
Each spark job is independent and there can only be one instance of SparkSession ( and SparkContext ) per JVM. You won't be able to create multiple session instances.
You want to create a new spark application for every file which is certainly possible as each spark application would have 1 corresponding spark session, it is not the recommended way though (usually).You can load multiple files using the same spark session object which is preferred (usually).

How many SparkSessions can a single application have?

I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I try to write temporary results to Hive tables (on HDFS), the system still doesn't free much memory, and my entire system crashes after about 130 joins.
However, through experimentation, I realized that if I break the problem into smaller pieces, write temporary results to hive tables, and Stop/Start the Spark session (and spark context), then the system's resources are freed. I was able to join over 1,000 columns using this approach.
But I can't find any documentation to understand if this is considered a good practice or not (I know you should not acquire multiple sessions at once). Most systems acquire the session in the beginning and close it in the end. I could also break the application into smaller ones, and use a driver like Oozie to schedule these smaller applications on Yarn. But this approach would start and stop the JVM at each stage, which seems a bit heavy-weight.
So my question: is it bad practice to continually start/stop the spark session to free system resources during the run of a single spark application?
But can you elaborate on what you mean by a single SparkContext on a single JVM? I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession. I then created a new SparkSession and used a new sparkContext. No error was thrown.
I was also able to use this on the JavaSparkPi without any problems.
I have tested this in yarn-client and a local spark install.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
TL;DR You can have as many SparkSessions as needed.
You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.
But can you elaborate on what you mean by a single SparkContext on a single JVM?
It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.
The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).
You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).
In other words, have a single SparkContext for the entire lifetime of your Spark application.
There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.
I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession.
So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?
SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.
From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).
If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.
I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.
// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]
// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)
scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2
// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100
import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]
I then created a new SparkSession and used a new SparkContext. No error was thrown.
Again, how does this contradict what I said about a single SparkContext being available? I'm curious.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?
Try the following:
Stop SparkContext
Execute any processing using Spark Core's RDD or Spark SQL's Dataset APIs
An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

Creating many, short-living SparkSessions

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.
So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?
I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?
This shows how multiple sessions can be build with different configures
Use
spark1.clearActiveSession();
spark1.clearDefaultSession();
To clear the sessions.
SparkSession spark1 = SparkSession.builder()
.master("local[*]")
.appName("app1")
.getOrCreate();
Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
df.show();
spark1.clearActiveSession();
spark1.clearDefaultSession();
SparkSession spark2 = SparkSession.builder()
.master("local[*]")
.appName("app2")
.getOrCreate();
Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
df2.show();
For your questions.
Spark context save the rdds in memory for quicker processing.
If there is lot of data . The save tables or rdds are moved to the hdd .
A session can access the tables if it saved as a view at any point.
It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

Resources