Is it possible to get sparkcontext of an already running spark application? - apache-spark

I am running spark on Amazon EMR with yarn as the cluster manager. I am trying to write a python app which starts and caches data in memory. How can I allow other python programs to access that cached data i.e.
I start an app Pcache -> Cache data and keep that app running.
Another user can access that same cached data running a different instance.
My understanding was that it should be possible to get a handle on the already running sparkContext and access that data? Is that possible? Or do I need to set up an API on top of that Spark App to access that data. Or may be use something like Spark Job Server of Livy.

It is not possible to share the SparkContext between multiple processes. Indeed your options are to build the API yourself, with one server holding the SparkContext and its clients telling it what to do with it, or use the Spark Job Server which is a generic implementation of the same.

I think this can help you. :)
classmethod getOrCreate(conf=None)
Get or instantiate a SparkContext and register it as a singleton object.
Parameters: conf – SparkConf (optional)
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.getOrCreate

Related

Spark context is created every time a batch job is executed in yarn

I am wondering, is there any way where i create the spark-context once in the YARN cluster, then the incoming jobs will re-use that context. The context creation takes 20s or sometimes more in my cluster. I am using pyspark for scripting and livy to submit jobs.
No, you can't just have a standing SparkContext running in Yarn. Maybe another idea is to run in client mode, where the client has it's own SparkContext (this is the method used by tools like Apache Zeppelin and the spark-shell).
An option would be to use Apache Livy. Livy is an additional server in your Yarn cluster that provides an interface for clients that want to run Spark jobs on the cluster. One of Livy's features is that you can
Have long running Spark Contexts that can be used for multiple Spark jobs, by multiple clients
If the client is written in Scala or Java it is possible to use a programmatic API:
LivyClient client = new LivyClientBuilder()....build();
Object result = client.submit(new SparkJob(sparkParameters)).get();
All other clients can use a REST API.

Do I need to/how to clean up Spark Sessions?

I am launching Spark 2.4.6 in a Python Flask web service. I'm running a single Spark Context and I have also enabled FAIR scheduling.
Each time a user makes a request to one of the REST end points I call spark = sparkSession.newSession() and then execute various operations using Spark SQL in this somewhat isolated environment.
My concern is, after 100 or 10,000 or a million requests with an equal number of new sessions, at some point am I going to run into issues? Is there a way to let my SparkContext know that I don't need an old session anymore and that it can be cleared?

Get Spark session on executor

After deploying a spark structure streaming application, how can I obtain a spark session on the executor for deploying another job with the same session and same configuration settings?
You cannot get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
I may be able to help you with this if you can tell me the problem statement.
Technically you can get spark session on the executor doesn't matter which mode you are running it in but not really worth the effort.Spark session is an object of various internal spark settings along with other user defined settings we provide on startup.
The only reason those configuration settings are not available in executor is because most of them are marked as transient which means those objects will be sent as null as it does not make logical sense to send them to the executors, in the same way it does not make sense to send database connection objects from one node to another.
One of the cumbersome ways to do this would be to get all configuration settings from your spark session in your driver, set in some custom object marked as serializable and send it to the executor. Also your executor environment should be same as driver in terms of all spark jars/directories and other spark properties such as SPARK_HOME etc which can be hectic if you run and realize every time you are missing something. It will be a different spark session object but with all the same settings.
The better option would be to run another spark application with the same settings you provide for your other application as one spark session is associated for one spark application.
It is not possible. I also had similar requirement then I have to create two separate main class and one spark launcher class in that I was doing sparksession.conf.set(main class name ) based on which class i wanted to run. If I want to run both then I was using thread.sleep() to complete first before launching another. I also used sparkListener code to get status whether it has completed or not.
I am aware that this is a late response. Just thought this might be useful.
So, you can use something like the below code snippet in you spark structured streaming application:
for spark versions <= 3.2.1
spark_session_for_this_micro_batch = microBatchOutputDF._jdf.sparkSession()
For spark versions >= 3.3.1:
spark_session_for_this_micro_batch = microBatchOutputDF.sparkSession
Your function can use this spark session to create dataframe there.
You can refer this medium post
pyspark doc

PySpark application creates many pyspark-shell sessions

I have started working on Spark using Python. I'm working on an application that uses SparkML Linear Regression APIs. When I submit my job in YARN cluster mode, during the execution phase, many pyspark-shell apps get created with YARN as the user. I could see them in the YARN UI. They eventually get finished with succeeded status and my main application which I actually submitted then gets finished with succeeded status. Is this an expected behavior? This is kinda interesting to me since I create the singleton sparkSession instance and use it throughout my application so I don't know why pyspark-shell sessions/apps get created.
The immediate solution would be to use sparkContext instead of sparkSession. But it would be interesting to see your configuration lines to see how you're creating your sessions to be able to tell why multiple apps are being created.
We just updated to Spark 2.2 from Spark 1.6, so we have yet to delve seriously into sparkSessions (which are new in 2+).

why Livy or spark-jobserver instead of a simple web framework?

I'm building a RESTful API on top of Apache Spark. Serving the following Python script with spark-submit seems to work fine:
import cherrypy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext
class doStuff(object):
#cherrypy.expose
def compute(self, user_input):
# do something spark-y with the user input
return user_output
cherrypy.quickstart(doStuff())
But googling around I see things like Livy and spark-jobserver. I read these projects' documentation and a couple of tutorials but I still don't fully understand the advantages of Livy or spark-jobserver over a simple script with CherryPy or Flask or any other web framework. Is it about scalability? Context management? What am I missing here? If what I want is a simple RESTful API with not many users, are Livy or spark-jobserver worth the trouble? If so, why?
If you use spark-submit, you must upload manually JAR file to cluster and run command. Everything must be prepared before run
If you use Livy or spark-jobserver, then you can programatically upload file and run job. You can add additional applications that will connect to same cluster and upload jar with next job
What's more, Livy and Spark-JobServer allows you to use Spark in interactive mode, which is hard to do with spark-submit ;)
I won't comment on using Livy or spark-jobserver specifically but are at least three reasons to avoid embedding Spark context directly in your application:
Security with the main focus on reducing exposure of your cluster to the outside world. Attacker which gains control over your application can do anything between getting access to your data to executing arbitrary code on your cluster if cluster is not correctly configured.
Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows you to handle Spark issues gracefully, without full downtime of your application.
Responsiveness. User facing Spark API is mostly (in PySpark exclusively) synchronous. Using external service basically solves this problem for you.

Resources