Import cassandra ssstable db to hdfs via apache spark - apache-spark

We have use cases that involve importing daily the snapshot files from cassandra (db files) on to HDFS.
The file sizes can be anywhere from 200GB-1 TB and ideally we'd like to do the processing on hdfs instead of a single machine/server.
I have looked into sstabledump that allows this use case, but has the following issues:
It looks like it would run on a single machine which might end up being very slow depending on the number and size of db files
The consistency of data in the db file would be of concern- the cassandra is a 3 node cluster setup, and there are chances of a single record persisting across various db files
Ideally, we'd love to have a plugin/sdk that is able to read the db files directly from apache spark df api, whilst taking care of data redundancy and integration
Questions
Is such a connection via apache spark possible?
If not,is there a better way to approach is problem using sstableudmp - for eg- a mode that gets ride of duplicate data across various db files?
Are there any tools except sstabledump that are better suited for this?

Related

How do I replicate a Cassandra's local node for other Cassandra's remote node?

I need to replicate a local node with a SimpleStrategy to a remote node in other Cassandra's DB. Does anyone have any idea where I begin?
The main complexity here, if you're writing data into both clusters is how to avoid overwriting the data that has changed in the cloud later than your local setup. There are several possibilities to do that:
If structure of the tables is the same (including the names of the keyspaces if user-defined types are used), then you can just copy SSTables from your local machine to the cloud, and use sstableloader to replay them - in this case, Cassandra will obey the actual writetime, and won't overwrite changed data. Also, if you're doing deletes from tables, then you need to copy SSTables before tombstones are expired. You may not copy all SSTables every time, just the files that has changed since last data upload. But you always need to copy SSTables from all nodes from which you're doing upload.
If structure isn't the same, then you can either look to using DSBulk or Spark Cassandra Connector. In both cases you'll need to export data with writetime as well, and then load it also with timestamp. Please note that in both cases if different columns have different writetime, then you will need to load that data separately because Cassandra allows to specify only one timestamp when updating/inserting data.
In case of DSBulk you can follow the example 19.4 for exporting of data from this blog post, and example 11.3 for loading (from another blog post). So this may require some shell scripting. Plus you'll need to have disk space to keep exported data (but you can use compression).
In case of Spark Cassandra Connector you can export data without intermediate storage if both nodes are accessible from Spark. But you'll need to write some Spark code for reading data using RDD or DataFrame APIs.

Parquet VS Database

I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

spark connector loading vs sstableloader performance

I have a spark job that right now pulls data from HDFS and transforms the data into flat files to load into the Cassandra.
The cassandra table is essentially 3 columns but the last two are map collections, so a "complex" data structure.
Right now I use the COPY command and get about 3k rows/sec load but thats extremely slow given that I need to load about 50milllion records.
I see I can convert the CSV file to sstables but I don't see an example involving map collections and/or lists.
Can I use the spark connector to cassandra to load data with map collections and lists and get better performance than just the COPY command?
Yes the Spark Cassandra Connector can be much much faster for files already in HDFS. Using spark you'll be able to distributedly grab and write into C*.
Even without Spark using a java based loader like https://github.com/brianmhess/cassandra-loader will give you a significant speed improvement.

Resources