Is it possible to access the Apache IgnitieRDD from C/C++ Application? - apache-spark

I would like to share data between Spark Executor and C++ process. Apart from storing the data as file in in-memory FS like Tachyon/IgniteFS, is there any other efficient method?

Ignite provides Spark with the ability to store results of its executions and share them between different spark jobs in a shared RDD call IgniteRDD.
In the nutshell IgniteRDD is a distributed named cache that can be accessed directly using basic cache.get like operations. It means that if you use Ignite C++ you can interact with such a cache, that is used by IgniteRDD as well, using basic cache API.

Related

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

How can I make Spark DataSet streamed to memory accessible to another spark application?

I have a Java application that acts as driver application for Spark. It does some data processing and streams a subset of data to memory.
Sample Code:
ds.writeStream()
.format("memory")
.queryName("orderdataDS")
.start();
Now I need another python application to access this dataset(orderdataDS).
How can this be accomplished?
You cannot, unless both applications share the same JVM driver process (like Zeppelin). If you want data to be shared between multiple applications, please use independent store, like RDBMS.
Overall memory sink is not intended for production:
This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory
To build upon the above answer, Spark was not built with concurrency in mind. Like what the answerer suggests, you need to back Spark with a "state store" like a RDBMS. There are a large number of options when you go to do this. I've detailed the majority of them here

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Writable Shared Memory in Apache Spark

I am working on a project of Twitter Data Analysis using Apache Spark with Java and Cassandra for NoSQL databases.
In the project I am working I want to maintain a arraylist of linkedlist(will use Java in built Arraylist and Linkedlist) which is common to all mapper nodes. I mean, if one mapper writes some data into arraylist it should be reflected to all other mapper nodes.
I am aware of broadcast shared variable, but that is read only shared variable, what I want is shared writable dataframe where changes by one mapper should be reflected in all.
Any advice on how to achieve this in apache spark with Java will be of great help.
Thanks in advance
Short, and most likely disappointing, answer is it is not possible given Spark architecture. Worker nodes don't communicate with each other and neither broadcast variables nor accumulators (write-only variables) are really shared variables. You can try different workarounds like using external services or shared file system to communicate but it introduces all kind of issues like idempotency or synchronizing.
As far as I can tell the best thing you can get is updating state between batches or using tools like StreamingContext.remember.

How to share Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context.
Is there any way to share an RDD between different Spark contexts?
As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin). Since RDD or DataFrame is just a small local object there is really not much to share.
Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.
No, an RDD is tied to a single SparkContext. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext and kick off operations on the RDDs.
If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.
You cant natively, in my understanding, RDD is not data, but a way to create data via transformations/filters from original data.
Another idea, is to share the final data instead. So, you will store the RDD in a data-store, such as :
- HDFS (a parquet file etc..)
- Elasticsearch
- Apache Ignite (in-memory)
I think you will love Apache Ignite: https://ignite.apache.org/features/igniterdd.html
Apache Ignite provides an implementation of Spark RDD abstraction
which allows to easily share state in memory across multiple Spark
jobs, either within the same application or between different Spark
applications.
IgniteRDD is implemented is as a view over a distributed Ignite cache,
which may be deployed either within the Spark job executing process,
or on a Spark worker, or in its own cluster.
(I let you dig their documentation to find what you are looking for.)

Resources