Me using spark-sql for data migration project.
So how should I implement stage area in spark ?
when to use spark sql cache or persists?
any real time use cases ?
~Sha
Similarly to RDD (What is the difference between cache and persist?) the only difference between cache and persist is ability to set non-default storage mode.
There is one important difference though. Unlike in RDD API, where cache uses MEMORY_ONLY, Dataset counterpart uses MEMORY_AND_DISK.
Related
If I execute a Filter on IgniteRDD, then the filter is pushed-down to Ignite Server, or first the Spark RDD should first collect all the data and then execute the filter within Spark Application?
There is no collect at all, but as far as I know there is a distinction between to cases:
Plain filter will use standard Spark execution.
sql will be processed by Ignite itself without Spark usage.
It all depends on Catalyst Optimizer. You can check the plans to understand your pipeline and see where is it executed. Also debugging might help.
As it explains here - IgniteRDD is an implementation of Spark RDD to represent Ignite cache and use spark API. As example there shows - filter would operate on cache directly.
I would ask whether Ignite is suitable for my use case which is:
Load all the data of oracle tables to the Ignite cache, and then do various SQL queries(aggregation/join/sub-query) against the data in the cache.
When oracle has newly created data or some data are updated, there are some way that these data can be inserted into the cache or update the corresponding entry in the cache
When the cache is down, there should be some way to restore the data from oracle?
Not sure Ignite SQLGrid can fit in this use case.
Also, I notice that IgniteRDD is not immutable, is IgniteRDD suitable for this use case? That is, I first load the data in oracle into IgniteRDD,
and make the corresponding changes to IgniteRDD with the newly created/updated data to oracle? But it looks that IgniteRDD doesn't support complicated SQL?( aggregation/join/sub-query)
This is one of the basic use cases supported by Ignite.
Data can pre-loaded from Oracle using one of the methods covered in this documentation section.
If you're planning to update the data in Ignite first and propagate to Oracle after (which is preferred way), then it makes sense to use Oracle as a CacheStore in write-through/read-through mode. Ignite will make sure to sync up data with the persistent layer. Moreover, it'll be straightforward to pre-load data from Oracle if the cluster is restarted.
Finally, you can take advantage of GridGain Web Console by connecting to Oracle and map Oracle's scheme to Ignite caches configuration and POJO objects.
As I mentioned, it's recommended to make all the updates through Ignite first which will persist them to Oracle. But if Oracle is updated by other applications that are not aware of Ignite you need to update Ignite cluster on your own somehow. Ignite doesn't have any feature that covers this use case. However, this can be easily implemented with GridGain, that is built on top of Ignite, with it's Oracle Golden Gate Integration.
Once the data is in the Ignite Cluster use SQL Grid to query and/or update your data. SQL Grid engine is ANSI-99 compliant and doesn't have any limitations.
As for Ignite Shared RDD, it stores data in a distributed Ignite cache. This is why it's mutable which is opposite to Spark native RDDs. Shared RDDs SQL capabilities are absolutely the same - it's just one more API on top of SQL Grid.
I have a spark application which cache RDDs at runtime based on datasets and perform operation.
For monitoring purpose I want to find out the number of RDDs in cache when application is running, does Spark provides any APIs to find out this details?
It is possible to use Spark REST API which provides two endpoints:
/applications/[app-id]/storage/rdd - list of all stored RDDs.
/applications/[app-id]/storage/rdd/[rdd-id] - detail information for particular RDD.
I would like to share data between Spark Executor and C++ process. Apart from storing the data as file in in-memory FS like Tachyon/IgniteFS, is there any other efficient method?
Ignite provides Spark with the ability to store results of its executions and share them between different spark jobs in a shared RDD call IgniteRDD.
In the nutshell IgniteRDD is a distributed named cache that can be accessed directly using basic cache.get like operations. It means that if you use Ignite C++ you can interact with such a cache, that is used by IgniteRDD as well, using basic cache API.
What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed