How do I replicate a Cassandra's local node for other Cassandra's remote node? - cassandra

I need to replicate a local node with a SimpleStrategy to a remote node in other Cassandra's DB. Does anyone have any idea where I begin?

The main complexity here, if you're writing data into both clusters is how to avoid overwriting the data that has changed in the cloud later than your local setup. There are several possibilities to do that:
If structure of the tables is the same (including the names of the keyspaces if user-defined types are used), then you can just copy SSTables from your local machine to the cloud, and use sstableloader to replay them - in this case, Cassandra will obey the actual writetime, and won't overwrite changed data. Also, if you're doing deletes from tables, then you need to copy SSTables before tombstones are expired. You may not copy all SSTables every time, just the files that has changed since last data upload. But you always need to copy SSTables from all nodes from which you're doing upload.
If structure isn't the same, then you can either look to using DSBulk or Spark Cassandra Connector. In both cases you'll need to export data with writetime as well, and then load it also with timestamp. Please note that in both cases if different columns have different writetime, then you will need to load that data separately because Cassandra allows to specify only one timestamp when updating/inserting data.
In case of DSBulk you can follow the example 19.4 for exporting of data from this blog post, and example 11.3 for loading (from another blog post). So this may require some shell scripting. Plus you'll need to have disk space to keep exported data (but you can use compression).
In case of Spark Cassandra Connector you can export data without intermediate storage if both nodes are accessible from Spark. But you'll need to write some Spark code for reading data using RDD or DataFrame APIs.

Related

Import cassandra ssstable db to hdfs via apache spark

We have use cases that involve importing daily the snapshot files from cassandra (db files) on to HDFS.
The file sizes can be anywhere from 200GB-1 TB and ideally we'd like to do the processing on hdfs instead of a single machine/server.
I have looked into sstabledump that allows this use case, but has the following issues:
It looks like it would run on a single machine which might end up being very slow depending on the number and size of db files
The consistency of data in the db file would be of concern- the cassandra is a 3 node cluster setup, and there are chances of a single record persisting across various db files
Ideally, we'd love to have a plugin/sdk that is able to read the db files directly from apache spark df api, whilst taking care of data redundancy and integration
Questions
Is such a connection via apache spark possible?
If not,is there a better way to approach is problem using sstableudmp - for eg- a mode that gets ride of duplicate data across various db files?
Are there any tools except sstabledump that are better suited for this?

Migrate data from one cassandra cluster to another

Hi I want to migrate data from my cassandra cluster to another cassandra cluster. I have seen many posts suggesting various methods but are not very clear or have limitations. The methods seen are as follows:
Using COPY TO and COPY FROM command: The is easy to use but seems to have a limitation on the number of rows it can copy.
Using SSTABLELOADER: Most articles suggests using sstableloader to move data from one cluster to another. But did not get clear details on steps to create sstables (is it possible to use some nodetool command or require java application to be created? Are these created per node or per cluster? If created how to move them from one cluster to another?) or creating snapshots which is tedious in way that they are created per node and have to be transferred to another cluster. Have also seen answers suggesting using parallel ssh to create snapshot for whole cluster but did not get any example for this as well.
Any help would be appreciated.
It's really a question that requires more information to provide definitive answer. For example, do you need to keep the metadata, such as, WriteTime and TTLs on data, or not? Does the destination cluster has the same topology (number of nodes, token allocation, etc.).
Basically, you have following options:
Use sstableloader - tool shipped with Cassandra itself that is used for restoring from backups, etc. To perform data migration you need to create a snapshot of the table to load (using nodetool snapshot), and run sstableloader on that snapshot. Main advantage is that it will keep metadata (TTL/WriteTime). Main disadvantage is that you need to perform taking snapshot/loading on all nodes of the source cluster, and you need to have exactly the same schema and partitioner in the destination cluster;
You can use backup/restore tool, such as, medusa, that basically automating the taking of snapshot & loading the data;
You can use Apache Spark to copy data from one table to another using Spark Cassandra Connector, for example, as described in this blog post - just read table for one cluster, and write to a table in another cluster. Works fine for simple copy operations, and you have a possibility to perform transformation of data if necessary, but becomes more complex if you need to preserve metadata. Plus it needs Spark;
Use DataStax Bulk Loader (DSBulk) to export data to files on disk, and load into another cluster. In contrast to cqlsh's COPY command, it's heavily optimized for loading/unloading of big amounts of data. It works with Cassandra 2.1+ and most DSE versions (except ancient ones).
If you are able to set up the target cluster with exactly the same topology as the source cluster, the fastest way may be to simply copy the data files from the source to the target cluster, since this avoids the overhead of processing the data to redistribute it to different nodes. In order for this to work, your target cluster must have the same number of nodes, the same rack configuration, and even the same tokens assigned to each node.
To get the tokens for a source node, you can run nodetool info -T | grep Token | awk '{print $3}' | tr '\n' , | sed 's/,$/\n/'. You can then copy the comma-separated list of tokens from the output and paste it into the initial_token setting in your target node's cassandra.yaml. Once you start the node, check its tokens using nodetool info -T to verify that it has the correct tokens. Repeat these steps for each node in the target cluster.
Once you have all of your target nodes set up with exactly the same tokens, DC, and racks as the source cluster, take a snapshot of the desired tables on the source cluster and copy the snapshots to the corresponding node's data directories on the target cluster. DataStax OpsCenter can automate the process of backing up and restoring data and will use direct copying for clusters with the same topology. It appears that medusa can do this too though I have not used this tool before.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

spark connector loading vs sstableloader performance

I have a spark job that right now pulls data from HDFS and transforms the data into flat files to load into the Cassandra.
The cassandra table is essentially 3 columns but the last two are map collections, so a "complex" data structure.
Right now I use the COPY command and get about 3k rows/sec load but thats extremely slow given that I need to load about 50milllion records.
I see I can convert the CSV file to sstables but I don't see an example involving map collections and/or lists.
Can I use the spark connector to cassandra to load data with map collections and lists and get better performance than just the COPY command?
Yes the Spark Cassandra Connector can be much much faster for files already in HDFS. Using spark you'll be able to distributedly grab and write into C*.
Even without Spark using a java based loader like https://github.com/brianmhess/cassandra-loader will give you a significant speed improvement.

Changing Cassandra partitioner type

Question about changing partitioner types: i want to use sstableloader to copy data from an old cluster to a newer cluster. but the old cluster is using RandomPartitioner, whereas the new one uses Murmur3Partitioner. You might ask why not using COPY command to export data to csv and import it again? Well we have huge data sets and COPY command would not work (all other node's data would aggregate to one machine).
is it possible to switch new cluster partitioner to RandomPartitioner, do data replication using sstableloader, and switch back? (i tried switching, but cassandra won't restart because of it...)
No, you can't change the partitioner. That would require Cassandra to redistribute all the data and is not supported.
You could use sstable2json (with the old yaml) and then json2sstable (with the new yaml) to convert your SSTables manually. Then you can use sstableloader.

Resources