Do I need to/how to clean up Spark Sessions? - apache-spark

I am launching Spark 2.4.6 in a Python Flask web service. I'm running a single Spark Context and I have also enabled FAIR scheduling.
Each time a user makes a request to one of the REST end points I call spark = sparkSession.newSession() and then execute various operations using Spark SQL in this somewhat isolated environment.
My concern is, after 100 or 10,000 or a million requests with an equal number of new sessions, at some point am I going to run into issues? Is there a way to let my SparkContext know that I don't need an old session anymore and that it can be cleared?

Related

Running a spark job in local mode inside an Openshift pod

I have a pyspark batch job scheduled on YARN. There is now a requirement to put the logic of the spark job into a web service.
I really don't want there to be 2 copies of the same code, and therefore would like to somehow reuse the spark code inside the service, only replacing the IO parts.
The expected size of the workloads per request is small so I don't want to complicate the service by turning it into a distributed application. I would like instead to run the spark code in local mode inside the service. How do I do that? Is that even a good idea? Are there better alternatives?

Pyspark does not release memory even after its operation completed

As per my currently requirement, I am using Pyspark with flask apis(Python Framework) and I am creating Spark session while flask apis server getting up. And I am using spark session for heavy wait computing while api call. So here what happen in each api request, Memory size increased even after its operation are done.
I have did following after each apis called
1: spark.catalog.clearCache()
2: df.unpersist()
Even my memory is gradually increased.
Any one help me to come out from this big issue. I have tring different different configuration even my memory is not reduced.
unpersist() is by default unpersist(blocking=false), which means it's just a flag on your dataframe, saying that Spark can delete it whenever possible. Meanwhile, unpersist(blocking=true) will block your process instead

Mesos implementation

I have two Django websites that create a Spark Session to a Cluster which is running on Mesos.
The problem is that whatever Django starts first will create a framework and take 100% the resources permanently, it grabs them and doesn't let them go even if idle.
I am lost on how to make the two frameworks use only the neede resources and have them concurrently access the Spark cluster.
Looked into spark schedulres, dynamic resources for spark and mesos but nothing seems to work.
Is it even possible or should I change the approach?
Self solved using dynamic allocation.

Get Spark session on executor

After deploying a spark structure streaming application, how can I obtain a spark session on the executor for deploying another job with the same session and same configuration settings?
You cannot get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
I may be able to help you with this if you can tell me the problem statement.
Technically you can get spark session on the executor doesn't matter which mode you are running it in but not really worth the effort.Spark session is an object of various internal spark settings along with other user defined settings we provide on startup.
The only reason those configuration settings are not available in executor is because most of them are marked as transient which means those objects will be sent as null as it does not make logical sense to send them to the executors, in the same way it does not make sense to send database connection objects from one node to another.
One of the cumbersome ways to do this would be to get all configuration settings from your spark session in your driver, set in some custom object marked as serializable and send it to the executor. Also your executor environment should be same as driver in terms of all spark jars/directories and other spark properties such as SPARK_HOME etc which can be hectic if you run and realize every time you are missing something. It will be a different spark session object but with all the same settings.
The better option would be to run another spark application with the same settings you provide for your other application as one spark session is associated for one spark application.
It is not possible. I also had similar requirement then I have to create two separate main class and one spark launcher class in that I was doing sparksession.conf.set(main class name ) based on which class i wanted to run. If I want to run both then I was using thread.sleep() to complete first before launching another. I also used sparkListener code to get status whether it has completed or not.
I am aware that this is a late response. Just thought this might be useful.
So, you can use something like the below code snippet in you spark structured streaming application:
for spark versions <= 3.2.1
spark_session_for_this_micro_batch = microBatchOutputDF._jdf.sparkSession()
For spark versions >= 3.3.1:
spark_session_for_this_micro_batch = microBatchOutputDF.sparkSession
Your function can use this spark session to create dataframe there.
You can refer this medium post
pyspark doc

How does a Spark Application work?

I am trying to implement a simple Spark SQL Application that takes a query as input and processes the data. But because I need to cache the data and I have to maintain a single SQL Context object. I am not able to understand how I can use same SQL context and keep getting queries from user.
So how does an application work? When an application is submitted to cluster, does it keep running on the cluster or performs a specific task and shuts down immediately after the task?
Spark application has a driver program that starts and configures the Spark Context. Driver program can be inside your application and you can use the same Spark Context throughout the life of your application.
Spark Context is thread safe, so multiple users can use it to run jobs concurrently.
There is an open source project Zeppelin that does just that.

Resources