Cassandra: Migrate keyspace data from Multinode cluster to SingleNode Cluster

Cassandra: Migrate keyspace data from Multinode cluster to SingleNode Cluster - cassandra

I have a keyspace in a multi-node cluster in QA environment. I want to copy that keyspace to my local single-node cluster. Is there any direct way to do this? I can't afford to write some code like SSTableLoader implementation at this point of time. Please suggest the quickest way.

Make sure you have plenty of free disk space on your new node and that you've properly set replication factor and consistency levels in your tests/build for your new, single node "cluster"
First, restore the exact schema from the old cluster to your new node. After that the data can be loaded in two ways:
1.) Execute the "sstableloader" utility on every node in your old cluster and point it at your new node. sstableloader is token aware, but in your case it will end up shipping all data to your new, single node cluster.
sstableloader -d NewNode /Path/To/OldCluster/SStables
2.) Snapshot the keyspace and copy the raw sstable files from the snapshot folders of each table in your old cluster to your new node. Once they're all there, copy the files to their corresponding table directory and run "nodetool refresh."
# Rinse and repeat for all tables
nodetool snapshot -t MySnapshot
cd /Data/keyspace/table-UUID/snapshots/MySnapshot/
rsync -avP ./*.db User#NewNode:/NewData/Keyspace/table-UUID
...
# when finished, exec the following for all tables in your new node
nodetool refresh keyspace table
Option #1 is probably best because it will stream the data and compact naturally on the new node. It's also less manual work. Option #2 is good, quick, and dirty if you don't have a direct line from one cluster to the other. You probably won't notice much difference since it's probably a relatively small keyspace for QA.

Related

Data Inconsistency in Cassandra Cluster after migration of data to a new cluster

I see some data inconsistency after moving data to a new cluster.
Old cluster has 9 nodes in total and each has got 2+ TB of data on it.
New cluster has same set of nodes as old and configuration is same.
Here is what I've performed in order:
nodetool snapshot.
Copied that snapshot to destination
Created a new Keyspace on Destination Cluster.
Used sstableloader utility to load.
Restarted all nodes.
After successful completion of transfer, I ran few queries to compare(Old vs New Cluster) and found out that the new cluster is not consistent but the data I see is properly distributed on each node (nodetool status).
Same query returns different sets of results for some of the partitions and I get zero rows first time, second time 100 rows,200 rows and eventually it becomes consistent for few partitions and record count matches with old cluster.
Few partitions have no data in the new cluster where as old cluster has data for those partitions.
I tried running queries on cqlsh with CONSISTENCY ALL but the problem still exist.
Did i miss any important steps to consider before and after?
Is there any procedure to find out the root cause of this?
I am currently running "nodetool repair" but I doubt if that could solve as I tried with Consistency ALL.
Highly Appreciate your help!

The fact that the results eventually becomes consistent indicates that the replicas are out-of-sync.
You can verify this by reviewing the logs around the time that you were loading data, particularly for dropped mutations. You can also check the output of nodetool netstats. If you're seeing blocking read repairs, that's another confirmation that the replicas are out-of-sync.
If you still have other partitions you can test, enable TRACING ON in cqlsh when you query with CONSISTENCY ALL. You will see if there are digest mismatches in the trace output which should also trigger read repairs. Cheers!
[EDIT] Based on your comments below, it sounds like you possibly did not load the snapshots from ALL the nodes in the source cluster with sstableloader. If you've missed loading SSTables to the target cluster, then that would explain why data is missing.

Migrate data from one cassandra cluster to another

Hi I want to migrate data from my cassandra cluster to another cassandra cluster. I have seen many posts suggesting various methods but are not very clear or have limitations. The methods seen are as follows:
Using COPY TO and COPY FROM command: The is easy to use but seems to have a limitation on the number of rows it can copy.
Using SSTABLELOADER: Most articles suggests using sstableloader to move data from one cluster to another. But did not get clear details on steps to create sstables (is it possible to use some nodetool command or require java application to be created? Are these created per node or per cluster? If created how to move them from one cluster to another?) or creating snapshots which is tedious in way that they are created per node and have to be transferred to another cluster. Have also seen answers suggesting using parallel ssh to create snapshot for whole cluster but did not get any example for this as well.
Any help would be appreciated.

It's really a question that requires more information to provide definitive answer. For example, do you need to keep the metadata, such as, WriteTime and TTLs on data, or not? Does the destination cluster has the same topology (number of nodes, token allocation, etc.).
Basically, you have following options:
Use sstableloader - tool shipped with Cassandra itself that is used for restoring from backups, etc. To perform data migration you need to create a snapshot of the table to load (using nodetool snapshot), and run sstableloader on that snapshot. Main advantage is that it will keep metadata (TTL/WriteTime). Main disadvantage is that you need to perform taking snapshot/loading on all nodes of the source cluster, and you need to have exactly the same schema and partitioner in the destination cluster;
You can use backup/restore tool, such as, medusa, that basically automating the taking of snapshot & loading the data;
You can use Apache Spark to copy data from one table to another using Spark Cassandra Connector, for example, as described in this blog post - just read table for one cluster, and write to a table in another cluster. Works fine for simple copy operations, and you have a possibility to perform transformation of data if necessary, but becomes more complex if you need to preserve metadata. Plus it needs Spark;
Use DataStax Bulk Loader (DSBulk) to export data to files on disk, and load into another cluster. In contrast to cqlsh's COPY command, it's heavily optimized for loading/unloading of big amounts of data. It works with Cassandra 2.1+ and most DSE versions (except ancient ones).

If you are able to set up the target cluster with exactly the same topology as the source cluster, the fastest way may be to simply copy the data files from the source to the target cluster, since this avoids the overhead of processing the data to redistribute it to different nodes. In order for this to work, your target cluster must have the same number of nodes, the same rack configuration, and even the same tokens assigned to each node.
To get the tokens for a source node, you can run nodetool info -T | grep Token | awk '{print $3}' | tr '\n' , | sed 's/,$/\n/'. You can then copy the comma-separated list of tokens from the output and paste it into the initial_token setting in your target node's cassandra.yaml. Once you start the node, check its tokens using nodetool info -T to verify that it has the correct tokens. Repeat these steps for each node in the target cluster.
Once you have all of your target nodes set up with exactly the same tokens, DC, and racks as the source cluster, take a snapshot of the desired tables on the source cluster and copy the snapshots to the corresponding node's data directories on the target cluster. DataStax OpsCenter can automate the process of backing up and restoring data and will use direct copying for clusters with the same topology. It appears that medusa can do this too though I have not used this tool before.

What is the best way to change the partitioner in cassandra

Currently we are using random partitioner and we want to update that to murmur3 partitioner. I know we can achive this by using sstable2json and then json2sstable to convert your SSTables manually. Then I can use sstableloader or we need to create new cluster with murmur3 and write an application to pull all the data from old cluster and write to a new cluster.
is there a other easy way to achieve this?

There is no easy way, its a pretty massive change so might want to check on if its absolutely necessary (do some benchmarks, its likely undetectable). Its more a kind of change to make if your switching to a new cluster anyway.
To do it live: Create a new cluster thats murmur3, write to both clusters. In background read and copy data to new cluster while the writes are duplicated. Once background job is complete flip reads from old cluster to new cluster and then you can decommission old cluster.
Offline: sstable2json->json2sstable is pretty inefficient mechanism. Will be a lot faster if you use an sstable reader and use sstable writer (ie edit SSTableExport in cassandra code to write a new sstable instead of dumping output). If you have smaller dataset the cqlsh COPY command may be viable.

Restore Cassandra snapshot to new keyspace in same cluster

I've found documentation on restoring a keyspace snapshot to the same keyspace and also restoring it to a new cluster. However, I'm trying to make a copy of a keyspace in Cassandra and cannot find how to restore a snapshot to a new keyspace. Does anyone know if this is possible or have other recommendations on how to make a copy of the keyspace?

Step 1:
In your new keyspace, redefine the column families the same way as they were defined in the old keyspace. You can get the list of commands by running this cql:
DESCRIBE KEYSPACE ;
Note that here, your keyspace replication factor, etc shall remain the same.
Step 2 (do this on each node):
Under the old keyspace folder inside Cassandra data directory, there should be a snapshot folder per ColumnFamily. Copy the SSTables directly from the snapshot folders to the relevant ColumnFamily folders of the new keyspace inside Cassandra directory.
Step 3:
Do a rolling restart, and run repair on each node.

Cassandra keyspace disappeared leading to data loss

I was adding a node (cassandra-03) to my Cassandra 2.1.8 cluster (2 existing nodes, cassandra-01 and cassandra-02, 160+GB each, 1 keyspace), following http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html.
At stage #3 (after restarting each nodes), I realized that on my existing nodes (cassandra-01 and cassandra-02), my keyspace disappeared, but the data are still on the filesystem.
nodetool status gives the expected output (3 nodes cluster), except on the data column (I was expecting 160GB on cassandra-01 and cassandra-02), where I only have a few KB.
I moved forward on step #4 and ran nodetool cleanup on cassandra-01. It worked in a few seconds, but my keyspace is still missing.
I re-created my keyspace via cqlsh, hoping cassandra will use the data sitting on the filesystem, with no luck.
Nothing weird on the logs, as far as I can tell.
How could I get my keyspace data back?

I wasn't able to use the SSTable files in my new keyspace (created with the same name as the original one), so I used sstableloader tool to reinject my data into my newly created keyspace (with all the tables created):
$ sudo mv /var/lib/cassandra/data/mykeyspace /otherlocation/mykeyspace
$ sstableloader -d <host> -f /etc/cassandra/cassandra.yaml -v /otherlocation/mykeyspace/tablename-<token>;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string