GridGain - Load data from HBase - gridgain

I have an application that queries HBase and persist those data to an Oracle database. These data is used for report generation. I would like know whether the same approach is possible when using GridGain in place of Oracle database. Is it possible to get the HBase data and load it to the GridGain memory cache and use for generating reports?

I think the approach of loading data into GridGain cache is the right approach. Once loaded, you will be either able to run Parallel Computations or Distributed SQL Queries directly over GridGain cache.

Related

what is the best way to migrate data from HBase to Cassandra for Janusgraph backend storage?

I am using HBase as backend for Janusgraph. I have to migrate to Cassandra as backend. What is the best way to migrate the old data?
one way to go for it is to read data from Hbase and put into Cassandra using java code.
Migrating data from JanusGraph is not well supported, so I would prefer myself to start from copies of the data that were made before ingesting it into JanusGraph. If that is not an option, your suggestion of using java code to read from one graph and ingest into the other comes first.
Naturally, you want to parallellize this, because millions of operations on a single thread and process take too long for being practical. Although JanusGraph supports OLAP traversals for reading vertices and edges in parallel, JanusGraph OLAP has its own problems and you are probably better of segmenting the data using a mixed index in JanusGraph and have each process/thread read the segment assigned to it using an OLTP traversal.

How to handle backpressure on databases when using Apache Spark?

We are using Apache Spark for performing ETL for every 2 hours.
Sometimes Spark puts much pressure on databases when read/write operation is performed.
For Spark Streaming, I can see backpressure configuration on kafka.
Is there a way to handle this issue in batch processing?
Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.
What should be done here is actually on the reading end.
Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :
Spark JDBC fetchsize option
What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
Unfortunately, this might not solve all of your performance issues with your RDBMS.
What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.
To deal with this, I suggest the following :
If available, consider using specialized data sources over JDBC
connections.
Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.
Be sure to understand performance implications of different JDBC data source
variants, especially when working with production database.
Consider using a separate replica for Spark jobs.
If you wish to know more about Reading data using the JDBC source, I suggest you read the following :
Spark SQL and Dataset API.
Disclaimer: I'm the co-author of that repo.

Is it right to access external cache in apache spark applications?

We have many micro-services(java) and data is being written to hazelcast cache for better performance. Now the same data needs to be made available to Spark application for data analysis. I am not sure If this is right design approach to access external cache in apache spark. I cannot make database calls to get the data as there will be many database hits which might affect micro-services(currently we dont have http caching).
I thought about pushing the latest data into Kafka and read the same in spark. However, data(each message) might be big(> 1 MB sometimes) which is not right.
If its ok to use external cache in apache spark, is it better to use hazelcast client or to read Hazelcast cached data over rest service ?
Also, please let me know If there are any other recommended way of sharing data between Apache Spark and micro-services
Please let me know your thoughts. Thanks in advance.

Is Apache Ignite suitable for my usecase(load oracle tables to cache,do join between these tables, and reflect changes to oracle data)

I would ask whether Ignite is suitable for my use case which is:
Load all the data of oracle tables to the Ignite cache, and then do various SQL queries(aggregation/join/sub-query) against the data in the cache.
When oracle has newly created data or some data are updated, there are some way that these data can be inserted into the cache or update the corresponding entry in the cache
When the cache is down, there should be some way to restore the data from oracle?
Not sure Ignite SQLGrid can fit in this use case.
Also, I notice that IgniteRDD is not immutable, is IgniteRDD suitable for this use case? That is, I first load the data in oracle into IgniteRDD,
and make the corresponding changes to IgniteRDD with the newly created/updated data to oracle? But it looks that IgniteRDD doesn't support complicated SQL?( aggregation/join/sub-query)
This is one of the basic use cases supported by Ignite.
Data can pre-loaded from Oracle using one of the methods covered in this documentation section.
If you're planning to update the data in Ignite first and propagate to Oracle after (which is preferred way), then it makes sense to use Oracle as a CacheStore in write-through/read-through mode. Ignite will make sure to sync up data with the persistent layer. Moreover, it'll be straightforward to pre-load data from Oracle if the cluster is restarted.
Finally, you can take advantage of GridGain Web Console by connecting to Oracle and map Oracle's scheme to Ignite caches configuration and POJO objects.
As I mentioned, it's recommended to make all the updates through Ignite first which will persist them to Oracle. But if Oracle is updated by other applications that are not aware of Ignite you need to update Ignite cluster on your own somehow. Ignite doesn't have any feature that covers this use case. However, this can be easily implemented with GridGain, that is built on top of Ignite, with it's Oracle Golden Gate Integration.
Once the data is in the Ignite Cluster use SQL Grid to query and/or update your data. SQL Grid engine is ANSI-99 compliant and doesn't have any limitations.
As for Ignite Shared RDD, it stores data in a distributed Ignite cache. This is why it's mutable which is opposite to Spark native RDDs. Shared RDDs SQL capabilities are absolutely the same - it's just one more API on top of SQL Grid.

Can we use Apache Spark to store Data? or is it only a Data processing tool?

I am new to Apache Spark, I would like to know is it possible to store data using Apache Spark. Or is it only a processing tool?
Thanks for spending your time,
Satya
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
In real life use-case you usually have database, or data repository frome where you access data from spark.
Spark can access data that's in:
SQL Databases (Anything that can be connected using JDBC driver)
Local files
Cloud storage (eg. Amazon S3)
NoSQL databases.
Hadoop File System (HDFS)
and many more...
Detailed description can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Apache Spark is primarily processing engine. It works with underlying file systems such as HDFS, s3 and other supported file systems. It has capabilities to read the data from relational databases as well. But primarily it is in memory distributed processing tool.
As you can read in Wikipedia, Apache Spark is defined as:
is an open source cluster computing framework
When we refer about computing, it's related to a processing tool, in essence it allows to work as a pipeline scheme (or somehow ETL), you read the dataset, you process the data, and then you store the data processed, or models that describe the data.
If your main objective is to distribute your data, there are some good alternatives like HDFS (Hadoop File System), and others.

Resources