Writable Shared Memory in Apache Spark - apache-spark

I am working on a project of Twitter Data Analysis using Apache Spark with Java and Cassandra for NoSQL databases.
In the project I am working I want to maintain a arraylist of linkedlist(will use Java in built Arraylist and Linkedlist) which is common to all mapper nodes. I mean, if one mapper writes some data into arraylist it should be reflected to all other mapper nodes.
I am aware of broadcast shared variable, but that is read only shared variable, what I want is shared writable dataframe where changes by one mapper should be reflected in all.
Any advice on how to achieve this in apache spark with Java will be of great help.
Thanks in advance

Short, and most likely disappointing, answer is it is not possible given Spark architecture. Worker nodes don't communicate with each other and neither broadcast variables nor accumulators (write-only variables) are really shared variables. You can try different workarounds like using external services or shared file system to communicate but it introduces all kind of issues like idempotency or synchronizing.
As far as I can tell the best thing you can get is updating state between batches or using tools like StreamingContext.remember.

Related

Can Spark executors be spawned in already running java process (Ignite JVM)

I am working on a project where I need to share execution state across different spark application.
I decided to go with apache-ignite as a shared memory storage between different spark application.
I was thinking of going with embedded ignite mode with static allocation in spark where
ignite nodes will start in Spark executor process. So that, tasks will be executed in the same process where Data is present. But, this mode is deprecated.
I could go with standalone Ignite deployment but there would be inter-process communication to get and save the state which I want to avoid.
Is there any way to tell the Spark to create its executors in already present process (in this case, Ignite nodesprocesses) ?
Can ExternalClusterManager be implemented to achieve this ?
Does Ignite is planning to introduce such mode in future ?
Well, yes, your general direction is reasonable. Ignite's deprecated embedded deployment is, so to say, embedded "backwards" - when you embed Ignite into Spark it works poorly, but if we embedded Spark into Ignite, it would work better.
Yes, I assume it would be possible to implement. It probably could even be implemented outside of Ignite.
I don't think there are any open issues for that in Ignite backlog, but you can share you suggestions on Ignite dev mailing list.
And now the main part. All you're going to achieve with your suggestion is replacing inter-process communication with intra-process. Usually, communication on the same host isn't that expensive. You might see some performance gain from this but I'd only went into implementing this if there were a solid evidence that this is going to solve a real problem.

How can I make Spark DataSet streamed to memory accessible to another spark application?

I have a Java application that acts as driver application for Spark. It does some data processing and streams a subset of data to memory.
Sample Code:
ds.writeStream()
.format("memory")
.queryName("orderdataDS")
.start();
Now I need another python application to access this dataset(orderdataDS).
How can this be accomplished?
You cannot, unless both applications share the same JVM driver process (like Zeppelin). If you want data to be shared between multiple applications, please use independent store, like RDBMS.
Overall memory sink is not intended for production:
This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory
To build upon the above answer, Spark was not built with concurrency in mind. Like what the answerer suggests, you need to back Spark with a "state store" like a RDBMS. There are a large number of options when you go to do this. I've detailed the majority of them here

Drawbacks of using embedded Spark in Application

I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.

Is it possible to access the Apache IgnitieRDD from C/C++ Application?

I would like to share data between Spark Executor and C++ process. Apart from storing the data as file in in-memory FS like Tachyon/IgniteFS, is there any other efficient method?
Ignite provides Spark with the ability to store results of its executions and share them between different spark jobs in a shared RDD call IgniteRDD.
In the nutshell IgniteRDD is a distributed named cache that can be accessed directly using basic cache.get like operations. It means that if you use Ignite C++ you can interact with such a cache, that is used by IgniteRDD as well, using basic cache API.

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Resources