Cassandra Loading Options - apache-spark

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,

For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

Related

Import cassandra ssstable db to hdfs via apache spark

We have use cases that involve importing daily the snapshot files from cassandra (db files) on to HDFS.
The file sizes can be anywhere from 200GB-1 TB and ideally we'd like to do the processing on hdfs instead of a single machine/server.
I have looked into sstabledump that allows this use case, but has the following issues:
It looks like it would run on a single machine which might end up being very slow depending on the number and size of db files
The consistency of data in the db file would be of concern- the cassandra is a 3 node cluster setup, and there are chances of a single record persisting across various db files
Ideally, we'd love to have a plugin/sdk that is able to read the db files directly from apache spark df api, whilst taking care of data redundancy and integration
Questions
Is such a connection via apache spark possible?
If not,is there a better way to approach is problem using sstableudmp - for eg- a mode that gets ride of duplicate data across various db files?
Are there any tools except sstabledump that are better suited for this?

Insert multiple rows in cqlengine

I am stucking on insert/update multiple rows /approximately 800 rows/ to cassandra table by cqlengine. I do not want to use loop in python. I searched and find batch query. But can not use it.
Please help me making batch query or give other efficient way to insert multiple rows in cassandra.
Thank you.
https://cqlengine.readthedocs.io/en/latest/topics/queryset.html#batch-queries
CQL batches are not an optimisation -- they do not make your queries run faster. In fact, they do the opposite if you have large batches because they can overload the coordinator of the request and queries end up running slower.
CQL batches are designed to achieve atomicity so either (a) all the statements in the batch are executed successfully, or (b) none at all.
In Cassandra, you can achieve a higher throughput if you issue multiple asynchronous writes instead of a single batch. And more app instances (clients) perform better because the traffic can get bottlenecked with a single client app.
If your goal is bulk load data, I recommend you instead use a tool like DataStax Bulk Loader (DSBulk). DSBulk is a free open-source software that allows you to bulk load data in CSV or JSON format to a Cassandra cluster.
Here are some resources to help you get started:
Introducing DataStax Bulk Loader
Loading data to Cassandra with DSBulk
More loading examples with DSBulk
Common DSBulk settings
Unloading data from Cassandra with DSBulk
Counting data with DSBulk (handy for verifying records loaded)
Examples for loading from other locations

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Spark Storm or Flink - Big Data analysis

Can anyone recommend me which technology can be explored if I am having a large data set in Cassandra table (3 node cluster) and I need to perform a sum operation on records received on daily basis. The count so calculated needs to be updated in a MySQL table.
Steps to perform -
1. Fetch Ids from MY SQL table
2. Run Sum operation from Cassandra table
3. Insert/update the calculated sum value in MYSQL table
Currently I am using plain Java to perform these tasks using SQL and CQL queries but its very slow and in future data will be growing exponentially.
Can anyone suggest technologies that can be explored to get this task accomplish in fastest possible way and lowest development time.
There's not much to recommend, it depends only on the task you have and your own preferences.
Apache Storm is a streaming engine, it would be good if you want to process stream of entries, not a batch of data like in your case.
Both Apache Spark and Apache Flink will allow you to perform batch job once a day or make a streaming application that will calculate results from one day.
I prefer Apache Spark, as it has unified API for batch and streaming jobs (so you can easily change code from batch to streaming) and strong community support. Apache Flink supports real time streaming, however it's not necessary in your case.
However, you should look and these two frameworks on your own and choose this framework, which looks better for you. In my opinion both of them will be ok

Resources