Reuse Spark session across multiple Spark jobs - apache-spark

I have around 10 Spark jobs where each would do some transformation and load data into Database. The Spark session has to be opened individually for each job and closed and every time initialization consumes time.
Is it possible to create the Spark session only once and re-use the same across multiple jobs ?

Technically if you use a single Spark Session you will end-up having a single Spark application, because you will have to package and run multiple ETL (Extract, Transform, & Load) within a single JAR file.
If you are running those jobs in production cluster, most likely you are using spark-submit to execute your application jar, which will have to go through initializing phase every-time you submits a job through Spark Master -> Workers in client mode.
In general, having a long running spark session is mostly suitable for prototyping, troubleshooting and debugging purposes, for example a single spark session can be leveraged in spark-shell, or any other interactive development environment, like Zeppelin; but, not with spark-submit as far as I know.
All in all, a couple of design/business questions is worth to consider here; does merging multiple ETL jobs together will generate a code that is easy to sustain, manage and debug? Does it provide the required performance gain? Risk/Cost analysis ? etc.
Hope this would help

You can submit your job once, in other words do spark-submit once. Inside the code which is submitted you can have 10 calls each doing some transformation and load data into Database.
val spark : SparkSession = SparkSession.builder
.appName("Multiple-jobs")
.master("<cluster name>")
.getOrCreate()
method1()
method2()
def method1():Unit = {
//it will give the same spark session created outside the method.
val spark = SparkSession.builder.getOrCreate()
//work
}
However if the job is time consuming say it takes 10 minutes then in comparision you wouldn't be spending a lot of time in creating separate spark sessions. I wouldn't worry about 1 spark session per job. However I will be worried if a separate Spark session is created per method or per unit test case, that is where I will save spark sessions.

Related

Spark context is created every time a batch job is executed in yarn

I am wondering, is there any way where i create the spark-context once in the YARN cluster, then the incoming jobs will re-use that context. The context creation takes 20s or sometimes more in my cluster. I am using pyspark for scripting and livy to submit jobs.
No, you can't just have a standing SparkContext running in Yarn. Maybe another idea is to run in client mode, where the client has it's own SparkContext (this is the method used by tools like Apache Zeppelin and the spark-shell).
An option would be to use Apache Livy. Livy is an additional server in your Yarn cluster that provides an interface for clients that want to run Spark jobs on the cluster. One of Livy's features is that you can
Have long running Spark Contexts that can be used for multiple Spark jobs, by multiple clients
If the client is written in Scala or Java it is possible to use a programmatic API:
LivyClient client = new LivyClientBuilder()....build();
Object result = client.submit(new SparkJob(sparkParameters)).get();
All other clients can use a REST API.

Run both batch and real time jobs on Spark with jobserver

I have a spark job that runs every day as part of a pipeline and perform simple batch processing - let's say, adding a column to DF with other column's value squared. (old DF: x, new DF: x,x^2).
I also have a front app that consumes these 2 columns.
I want to allow my users to edit x and get the answer from the same code base.
Since the batch job is already written in spark, i looked for a way to achieve that against my spark cluster and run into spark jobserver which thought might help here.
My questions:
Can spark jobserver support both batch and single processing?
Can i use the same jobserver-compatible JAR to run a spark job on AWS EMR?
Open to hear about other tools that can help in such use case.
Thanks!
Not sure I understood your scenario fully, but with Spark Jobserver you can configure your batch jobs and pass different parameters to it.
Yes, once you have Jobserver-compatible JAR, you should be able to use it with Jobserver running with Spark in Standalone mode, with YARN or with EMR. But please take into account that you will need to make a setup for Jobserver on EMR. Open source documentation seems to be a bit outdated currently.

Get Spark session on executor

After deploying a spark structure streaming application, how can I obtain a spark session on the executor for deploying another job with the same session and same configuration settings?
You cannot get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
I may be able to help you with this if you can tell me the problem statement.
Technically you can get spark session on the executor doesn't matter which mode you are running it in but not really worth the effort.Spark session is an object of various internal spark settings along with other user defined settings we provide on startup.
The only reason those configuration settings are not available in executor is because most of them are marked as transient which means those objects will be sent as null as it does not make logical sense to send them to the executors, in the same way it does not make sense to send database connection objects from one node to another.
One of the cumbersome ways to do this would be to get all configuration settings from your spark session in your driver, set in some custom object marked as serializable and send it to the executor. Also your executor environment should be same as driver in terms of all spark jars/directories and other spark properties such as SPARK_HOME etc which can be hectic if you run and realize every time you are missing something. It will be a different spark session object but with all the same settings.
The better option would be to run another spark application with the same settings you provide for your other application as one spark session is associated for one spark application.
It is not possible. I also had similar requirement then I have to create two separate main class and one spark launcher class in that I was doing sparksession.conf.set(main class name ) based on which class i wanted to run. If I want to run both then I was using thread.sleep() to complete first before launching another. I also used sparkListener code to get status whether it has completed or not.
I am aware that this is a late response. Just thought this might be useful.
So, you can use something like the below code snippet in you spark structured streaming application:
for spark versions <= 3.2.1
spark_session_for_this_micro_batch = microBatchOutputDF._jdf.sparkSession()
For spark versions >= 3.3.1:
spark_session_for_this_micro_batch = microBatchOutputDF.sparkSession
Your function can use this spark session to create dataframe there.
You can refer this medium post
pyspark doc

Spark Job Server multithreading and dynamic allocation

I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?
Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.

Spark task deserialization time

I'm running a Spark SQL job, and when looking at the master UI, task deserialization time can take 12 seconds and the compute time 2 seconds.
Let me give some background:
1- The task is simple: run a query in a PostgreSQL DB and count the results in Spark.
2- The deserialization problem comes when running on a cluster with 2+ workers (one of them the driver) and shipping tasks to the other worker.
3- I have to use the JDBC driver for Postgres, and I run each job with Spark submit.
My questions:
Am I submitting the packaged jars as part of the job every time and that is the reason for the huge task deserialization time? If so, how can I ship everything to the workers once and then in subsequent jobs already have everything needed there?
Is there a way to keep the SparkContext alive between jobs (spark-submit) so that the deserialization times reduces?
Anyways, anything that can help not paying des. time every time I run a job in a cluster.
Thanks for your time,
Cheers
As I know, YARN supports cache application jars so that they are accessible each time application runs: pls refer to property spark.yarn.jar.
To support shared SparkContext between jobs and avoid the overhead of initializing it, there is a project spark-jobserver for this purpose.

Resources