How many RDDs in cache Spark - apache-spark

I have a spark application which cache RDDs at runtime based on datasets and perform operation.
For monitoring purpose I want to find out the number of RDDs in cache when application is running, does Spark provides any APIs to find out this details?

It is possible to use Spark REST API which provides two endpoints:
/applications/[app-id]/storage/rdd - list of all stored RDDs.
/applications/[app-id]/storage/rdd/[rdd-id] - detail information for particular RDD.

Related

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

How to decide when to use spark sql cache or persists?

Me using spark-sql for data migration project.
So how should I implement stage area in spark ?
when to use spark sql cache or persists?
any real time use cases ?
~Sha
Similarly to RDD (What is the difference between cache and persist?) the only difference between cache and persist is ability to set non-default storage mode.
There is one important difference though. Unlike in RDD API, where cache uses MEMORY_ONLY, Dataset counterpart uses MEMORY_AND_DISK.

Filtering from IgniteRDD happen locally in Spark Application or in Ignite Server?

If I execute a Filter on IgniteRDD, then the filter is pushed-down to Ignite Server, or first the Spark RDD should first collect all the data and then execute the filter within Spark Application?
There is no collect at all, but as far as I know there is a distinction between to cases:
Plain filter will use standard Spark execution.
sql will be processed by Ignite itself without Spark usage.
It all depends on Catalyst Optimizer. You can check the plans to understand your pipeline and see where is it executed. Also debugging might help.
As it explains here - IgniteRDD is an implementation of Spark RDD to represent Ignite cache and use spark API. As example there shows - filter would operate on cache directly.

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

How to share Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context.
Is there any way to share an RDD between different Spark contexts?
As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin). Since RDD or DataFrame is just a small local object there is really not much to share.
Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.
No, an RDD is tied to a single SparkContext. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext and kick off operations on the RDDs.
If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.
You cant natively, in my understanding, RDD is not data, but a way to create data via transformations/filters from original data.
Another idea, is to share the final data instead. So, you will store the RDD in a data-store, such as :
- HDFS (a parquet file etc..)
- Elasticsearch
- Apache Ignite (in-memory)
I think you will love Apache Ignite: https://ignite.apache.org/features/igniterdd.html
Apache Ignite provides an implementation of Spark RDD abstraction
which allows to easily share state in memory across multiple Spark
jobs, either within the same application or between different Spark
applications.
IgniteRDD is implemented is as a view over a distributed Ignite cache,
which may be deployed either within the Spark job executing process,
or on a Spark worker, or in its own cluster.
(I let you dig their documentation to find what you are looking for.)

Resources