How to share data from Spark RDD between two applications - apache-spark

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !

The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD

You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.

According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html

You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Related

How can I make Spark DataSet streamed to memory accessible to another spark application?

I have a Java application that acts as driver application for Spark. It does some data processing and streams a subset of data to memory.
Sample Code:
ds.writeStream()
.format("memory")
.queryName("orderdataDS")
.start();
Now I need another python application to access this dataset(orderdataDS).
How can this be accomplished?
You cannot, unless both applications share the same JVM driver process (like Zeppelin). If you want data to be shared between multiple applications, please use independent store, like RDBMS.
Overall memory sink is not intended for production:
This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory
To build upon the above answer, Spark was not built with concurrency in mind. Like what the answerer suggests, you need to back Spark with a "state store" like a RDBMS. There are a large number of options when you go to do this. I've detailed the majority of them here

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

How many RDDs in cache Spark

I have a spark application which cache RDDs at runtime based on datasets and perform operation.
For monitoring purpose I want to find out the number of RDDs in cache when application is running, does Spark provides any APIs to find out this details?
It is possible to use Spark REST API which provides two endpoints:
/applications/[app-id]/storage/rdd - list of all stored RDDs.
/applications/[app-id]/storage/rdd/[rdd-id] - detail information for particular RDD.

How to share Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context.
Is there any way to share an RDD between different Spark contexts?
As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin). Since RDD or DataFrame is just a small local object there is really not much to share.
Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.
No, an RDD is tied to a single SparkContext. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext and kick off operations on the RDDs.
If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.
You cant natively, in my understanding, RDD is not data, but a way to create data via transformations/filters from original data.
Another idea, is to share the final data instead. So, you will store the RDD in a data-store, such as :
- HDFS (a parquet file etc..)
- Elasticsearch
- Apache Ignite (in-memory)
I think you will love Apache Ignite: https://ignite.apache.org/features/igniterdd.html
Apache Ignite provides an implementation of Spark RDD abstraction
which allows to easily share state in memory across multiple Spark
jobs, either within the same application or between different Spark
applications.
IgniteRDD is implemented is as a view over a distributed Ignite cache,
which may be deployed either within the Spark job executing process,
or on a Spark worker, or in its own cluster.
(I let you dig their documentation to find what you are looking for.)

Resources