Programmatically access list of live Spark nodes - apache-spark

I've implemented a custom data layer on Spark that has Spark node persisting some data locally and announcing their persistence of data to the Spark master. This works great by running some custom code on each Spark node and master that we've written, but now I'd like to implement a replication protocol across my cluster. What I'd like to build is that once the master gets a message from a node saying it's persisted data, that the master can randomly select two other nodes and have them persist the same data.
I've been digging through the docs but I don't see an obvious way of the SparkContext giving me a list of live nodes. Am I missing something?

There isnt a public API for doing this. However, you could use the Developer API SparkListener (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener). You can create a custom SparkListener class and add it to the SparkContext as
sc.addSparkListener(yourListener)
The system will class the onBlockManagerAdded and onBlockManagerRemoved when a BlockManager gets added or removed, and from the BlockManager's ID, I believe you can get the URL of the nodes running the Spark live executors (which run BlockManagers).
I agree that this is a little hacky. :)

Related

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

Get Spark session on executor

After deploying a spark structure streaming application, how can I obtain a spark session on the executor for deploying another job with the same session and same configuration settings?
You cannot get spark session on to executor if you are running spark in cluster mode as spark session object cannot be serialised thus cannot send it to executor. Also, it is against spark design principles to do so.
I may be able to help you with this if you can tell me the problem statement.
Technically you can get spark session on the executor doesn't matter which mode you are running it in but not really worth the effort.Spark session is an object of various internal spark settings along with other user defined settings we provide on startup.
The only reason those configuration settings are not available in executor is because most of them are marked as transient which means those objects will be sent as null as it does not make logical sense to send them to the executors, in the same way it does not make sense to send database connection objects from one node to another.
One of the cumbersome ways to do this would be to get all configuration settings from your spark session in your driver, set in some custom object marked as serializable and send it to the executor. Also your executor environment should be same as driver in terms of all spark jars/directories and other spark properties such as SPARK_HOME etc which can be hectic if you run and realize every time you are missing something. It will be a different spark session object but with all the same settings.
The better option would be to run another spark application with the same settings you provide for your other application as one spark session is associated for one spark application.
It is not possible. I also had similar requirement then I have to create two separate main class and one spark launcher class in that I was doing sparksession.conf.set(main class name ) based on which class i wanted to run. If I want to run both then I was using thread.sleep() to complete first before launching another. I also used sparkListener code to get status whether it has completed or not.
I am aware that this is a late response. Just thought this might be useful.
So, you can use something like the below code snippet in you spark structured streaming application:
for spark versions <= 3.2.1
spark_session_for_this_micro_batch = microBatchOutputDF._jdf.sparkSession()
For spark versions >= 3.3.1:
spark_session_for_this_micro_batch = microBatchOutputDF.sparkSession
Your function can use this spark session to create dataframe there.
You can refer this medium post
pyspark doc

PySpark application creates many pyspark-shell sessions

I have started working on Spark using Python. I'm working on an application that uses SparkML Linear Regression APIs. When I submit my job in YARN cluster mode, during the execution phase, many pyspark-shell apps get created with YARN as the user. I could see them in the YARN UI. They eventually get finished with succeeded status and my main application which I actually submitted then gets finished with succeeded status. Is this an expected behavior? This is kinda interesting to me since I create the singleton sparkSession instance and use it throughout my application so I don't know why pyspark-shell sessions/apps get created.
The immediate solution would be to use sparkContext instead of sparkSession. But it would be interesting to see your configuration lines to see how you're creating your sessions to be able to tell why multiple apps are being created.
We just updated to Spark 2.2 from Spark 1.6, so we have yet to delve seriously into sparkSessions (which are new in 2+).

Storing Data in Spark In Memory

I have got a requirement of keeping the data in Spark's in memory in table format even when the SparkContext object dies, so that Tableau can access it.
I have used registerTempTable , but data gets removed once the SparkContext object dies.
Is it possible to store data like this?If not what possible way I can look into to feed data to Tableau without reading it from HDFS location.
You will need to do one of the below:
run your Spark application as a long running application. Spark streaming usually does that out of the box (when you do StreamingContext.awaitTermination()). I have never tried it myself but I think YARN and MESOS have support for long running tasks. As you mentioned whenever your SparkContext dies, all the data is lost (because all the information is stored in the context). I consider spark-shell a long running application, that's why most Tableau/Spark demos use it because the context never dies.
store it into a data store (HDFS, database, etc.)
Try to use some distributed in-memory framework/file system like Tachyon - not sure if it has Tableau connectors, though.
Does Tableau read data from custom Spark Application?
I use PowerBi (instead Tableau) and it queries Spark through Thrift client, so each time it dies and restarts, I send him "cache table myTable" query through odbc/jdbc driver
I came to know a very interesting answer to the question asked above.
TACHYON.
http://ampcamp.berkeley.edu/5/exercises/tachyon.html

Is it possible to get sparkcontext of an already running spark application?

I am running spark on Amazon EMR with yarn as the cluster manager. I am trying to write a python app which starts and caches data in memory. How can I allow other python programs to access that cached data i.e.
I start an app Pcache -> Cache data and keep that app running.
Another user can access that same cached data running a different instance.
My understanding was that it should be possible to get a handle on the already running sparkContext and access that data? Is that possible? Or do I need to set up an API on top of that Spark App to access that data. Or may be use something like Spark Job Server of Livy.
It is not possible to share the SparkContext between multiple processes. Indeed your options are to build the API yourself, with one server holding the SparkContext and its clients telling it what to do with it, or use the Spark Job Server which is a generic implementation of the same.
I think this can help you. :)
classmethod getOrCreate(conf=None)
Get or instantiate a SparkContext and register it as a singleton object.
Parameters: conf – SparkConf (optional)
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.getOrCreate

Resources