Which one is better to use for interacting cassandra with java? - cassandra

I need to insert and pull the data from AWS C* table.
My producer of data is defined with Spring Boot with Java8.
So which one I should I use for my project which is stable and efficient.
I got ways ( here I guess)
1. Sprinda-data-JPA.
2. cassandra-driver-core of datastax.

Disclaimer: Questions like this...asking which tool/library is "better" is subjective, and isn't usually a good fit for Stack Overflow.
That being said, the Spring Data Cassandra driver inherently violates two known Cassandra data access anti-patterns (that I know of):
Unbound SELECT COUNT(*) as a part of their paging mechanism.
Use of BATCH for multiple writes.
Additionally, the Spring Data Cassandra driver uses the DataStax driver, providing an additional delay for bug fixes and upgrades.
tl;dr;
You cannot go wrong by using the DataStax Java driver, and I highly recommend its use.

Related

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

How can I Rely on Cassandra, since it doesn't meet my needs, By Itself?

As we read about Cassandra, before, we decided to choose it as our main database.The most important, useful and special feature which encourage us to choose this db, was Scalability, which helps us using Large volumes of data.
But, It seems that, it can not meet our requirements by itself. I asked some questions about our requirement in Stackoverfolw and how we can response them using Cassandra, and the answer was using alternative tools on top of Cassandra as Spark, Solr, DSE Search Tools and etc.
Our case is BIG Data Really, but we will have a large variety of Queries, too.
With these explanations, is it wise to stay on Cassandra?... Or It's better to switch to another db?
Cassandra is not adequate for ad-hoc queries, so I would recommend that you use Hive on Cassandra, mapping your Cassandra tables to Hive tables, usig the connector: cassandra_handler_for_hive, ( and then use hive to do joins and conditions on non partition keys)
I should mention that the performance of queries using Hive with Cassandra is not reasonable, (I have had a case where count(*) on a table with 500M records took 1 hour on 4 nodes). As a work around I used to copy the tables in HDFS after that do the computation using data on HDFS, but this is not good solution if you are seeking the fresh data.
Now for your question: To use Cassandra or not, it depends realy on your needs, Cassandra have a good performance in read/write per second record.
If your needs are met with using Hive/Cassandra to do the queries you need, so why not stay on Cassandra?

Does Spark SQL include a table streaming optimization for joins?

Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?
When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */ hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.
I have looked for an answer to this question some time ago and all I could come up with was setting a spark.sql.autoBroadcastJoinThreshold parameter, which is by default 10 MB. It will then attempt to automatically broadcast all the tables with size smaller than the limit set by you. Join order plays no role here for this setting.
If you are interestend in further improving join performance, I highly recommend this presentation.
This is the upcoming Spark 2.3 here (RC2 is being voted for the next release).
As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
It does not in the latest (and voted to be released soon) Spark 2.3 either.
There is no support for STREAMTABLE hint, but given the recent change (in SPARK-20857 Generic resolved hint node) to build a hint framework that should be fairly easy to write.
You'd have to write some Spark optimizations and possibly physical plan(s) that would support STREAMTABLE (which seems like a lot of work) but it's possible. The tools are there.
Regarding join optimizations, in the upcoming Spark 2.3 there are two main logical optimizations:
ReorderJoin
CostBasedJoinReorder (exclusively for cost-based optimization)

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources