Migrating from cassandra 2.1 to scylla 4.5

Migrating from cassandra 2.1 to scylla 4.5 - cassandra

made:
nodetool flush (on cassandra)
nodetool snapshot name_snapshot (on cassandra)
cqlsh [IP] "-e DESC SCHEMA" > orig_schema.cql (on cassandra)
cqlsh [IP] --file 'orig_schema.cql' (on scylla)
while IFS= read -r d; do sstableloader -cph 14 -j 14 -nb -nx -d [IP] ${d}/ done < <(find /dir_snapshot/keyspace/* -prune -type d)
But the data is written to the cassandra (it is impossible to stop the application from writing to the cassandra),
how can I transfer/throw up the increment in scylla?
I understand that the application can be configured to write to both Cassandra and Scylla, but now the gap between the data is 4 months.
It is logical to transfer the changes in data which have accumulated via 4 months and then redirect the application into Scylla.

Pretty much what you eluded to already - if you can't take the application down for maintenance you will have to do the switch-over using the application.
The typical path here would be to change the application code to write to two clusters at once (Cassandra & Scylla) while continuing to read from Cassandra.
While this is going you can recover data up to the point where the double-writing was deployed into Scylla.
Once the Scylla Cluster and the Cassandra cluster contain the same amount of data you'll change the application to read and write from Scylla and decommission the Cassandra cluster
Step 1: Write C & S -> Read C
Step 2: Migrate old Data C -> S
Step 3: Write S -> Read S

Related

Migrate data from one cassandra cluster to another

Hi I want to migrate data from my cassandra cluster to another cassandra cluster. I have seen many posts suggesting various methods but are not very clear or have limitations. The methods seen are as follows:
Using COPY TO and COPY FROM command: The is easy to use but seems to have a limitation on the number of rows it can copy.
Using SSTABLELOADER: Most articles suggests using sstableloader to move data from one cluster to another. But did not get clear details on steps to create sstables (is it possible to use some nodetool command or require java application to be created? Are these created per node or per cluster? If created how to move them from one cluster to another?) or creating snapshots which is tedious in way that they are created per node and have to be transferred to another cluster. Have also seen answers suggesting using parallel ssh to create snapshot for whole cluster but did not get any example for this as well.
Any help would be appreciated.

It's really a question that requires more information to provide definitive answer. For example, do you need to keep the metadata, such as, WriteTime and TTLs on data, or not? Does the destination cluster has the same topology (number of nodes, token allocation, etc.).
Basically, you have following options:
Use sstableloader - tool shipped with Cassandra itself that is used for restoring from backups, etc. To perform data migration you need to create a snapshot of the table to load (using nodetool snapshot), and run sstableloader on that snapshot. Main advantage is that it will keep metadata (TTL/WriteTime). Main disadvantage is that you need to perform taking snapshot/loading on all nodes of the source cluster, and you need to have exactly the same schema and partitioner in the destination cluster;
You can use backup/restore tool, such as, medusa, that basically automating the taking of snapshot & loading the data;
You can use Apache Spark to copy data from one table to another using Spark Cassandra Connector, for example, as described in this blog post - just read table for one cluster, and write to a table in another cluster. Works fine for simple copy operations, and you have a possibility to perform transformation of data if necessary, but becomes more complex if you need to preserve metadata. Plus it needs Spark;
Use DataStax Bulk Loader (DSBulk) to export data to files on disk, and load into another cluster. In contrast to cqlsh's COPY command, it's heavily optimized for loading/unloading of big amounts of data. It works with Cassandra 2.1+ and most DSE versions (except ancient ones).

If you are able to set up the target cluster with exactly the same topology as the source cluster, the fastest way may be to simply copy the data files from the source to the target cluster, since this avoids the overhead of processing the data to redistribute it to different nodes. In order for this to work, your target cluster must have the same number of nodes, the same rack configuration, and even the same tokens assigned to each node.
To get the tokens for a source node, you can run nodetool info -T | grep Token | awk '{print $3}' | tr '\n' , | sed 's/,$/\n/'. You can then copy the comma-separated list of tokens from the output and paste it into the initial_token setting in your target node's cassandra.yaml. Once you start the node, check its tokens using nodetool info -T to verify that it has the correct tokens. Repeat these steps for each node in the target cluster.
Once you have all of your target nodes set up with exactly the same tokens, DC, and racks as the source cluster, take a snapshot of the desired tables on the source cluster and copy the snapshots to the corresponding node's data directories on the target cluster. DataStax OpsCenter can automate the process of backing up and restoring data and will use direct copying for clusters with the same topology. It appears that medusa can do this too though I have not used this tool before.

Cassandra system.hints table is empty even when the one of the node is down

I am learning Cassandra from academy.datastax.com. I am trying the Replication and Consistency demo on local machine. RF = 3 and Consistency = 1.
When my Node3 is down and I am updating my table using update command, the SYSTEM.HINTS table is expected to store hint for node3 but it is always empty.
Do I need to make any configurational changes for hints to work or the defaults are ok?
surjanrawat$ ccm node1 nodetool getendpoints mykeyspace mytable 1
127.0.0.3
127.0.0.4
127.0.0.5
surjanrawat$ ccm status
Cluster: 'mycluster'
--------------------
node1: UP
node3: DOWN
node2: UP
node5: UP
node4: UP
cqlsh:mykeyspace> select * from system.hints ;
target_id | hint_id | message_version | mutation
-----------+---------+-----------------+----------
(0 rows)

Did you use the exact same version of Cassandra to create the cluster? Since version 3+ the hints are stored in the local filesystem of the coordinator. I ask this because the exact same thing happened to me during that Datastax demo (I used 3.0.3 instead of 2.1.5) I replayed the steps but with 2.1.5 and the hints where there in the table just as expected.

Came across this post as I ran into the same issue. Latest versions of cassandra don't store the hints in system.hints table. I am using cassandra 3.9 and the hints are stored in the file system.
It is configured in cassandra.yaml file.
# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
# hints_directory: /var/lib/cassandra/hints
In my case I was using ccm and I was able to find the hints at
${user_home}/.ccm/mycluster/node1/hints
where mycluster is my cluster name and node1 is my node name. Hope this helps someone.

In order to understand why no hints are being written you need a good grasp on how Cassandra replicates data.
A replication factor of 2 means two copies of each row, where each
copy is on a different node. All replicas are equally important; there
is no primary or master replica.
With a replication factor of 3 and 5 nodes in the cluster you could lose another node and still not store hints because your data replication strategy is still valid. Try killing two more nodes and then check the hints table.

Cassandra: Migrate keyspace data from Multinode cluster to SingleNode Cluster

I have a keyspace in a multi-node cluster in QA environment. I want to copy that keyspace to my local single-node cluster. Is there any direct way to do this? I can't afford to write some code like SSTableLoader implementation at this point of time. Please suggest the quickest way.

Make sure you have plenty of free disk space on your new node and that you've properly set replication factor and consistency levels in your tests/build for your new, single node "cluster"
First, restore the exact schema from the old cluster to your new node. After that the data can be loaded in two ways:
1.) Execute the "sstableloader" utility on every node in your old cluster and point it at your new node. sstableloader is token aware, but in your case it will end up shipping all data to your new, single node cluster.
sstableloader -d NewNode /Path/To/OldCluster/SStables
2.) Snapshot the keyspace and copy the raw sstable files from the snapshot folders of each table in your old cluster to your new node. Once they're all there, copy the files to their corresponding table directory and run "nodetool refresh."
# Rinse and repeat for all tables
nodetool snapshot -t MySnapshot
cd /Data/keyspace/table-UUID/snapshots/MySnapshot/
rsync -avP ./*.db User#NewNode:/NewData/Keyspace/table-UUID
...
# when finished, exec the following for all tables in your new node
nodetool refresh keyspace table
Option #1 is probably best because it will stream the data and compact naturally on the new node. It's also less manual work. Option #2 is good, quick, and dirty if you don't have a direct line from one cluster to the other. You probably won't notice much difference since it's probably a relatively small keyspace for QA.

Cassandra keyspace disappeared leading to data loss

I was adding a node (cassandra-03) to my Cassandra 2.1.8 cluster (2 existing nodes, cassandra-01 and cassandra-02, 160+GB each, 1 keyspace), following http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html.
At stage #3 (after restarting each nodes), I realized that on my existing nodes (cassandra-01 and cassandra-02), my keyspace disappeared, but the data are still on the filesystem.
nodetool status gives the expected output (3 nodes cluster), except on the data column (I was expecting 160GB on cassandra-01 and cassandra-02), where I only have a few KB.
I moved forward on step #4 and ran nodetool cleanup on cassandra-01. It worked in a few seconds, but my keyspace is still missing.
I re-created my keyspace via cqlsh, hoping cassandra will use the data sitting on the filesystem, with no luck.
Nothing weird on the logs, as far as I can tell.
How could I get my keyspace data back?

I wasn't able to use the SSTable files in my new keyspace (created with the same name as the original one), so I used sstableloader tool to reinject my data into my newly created keyspace (with all the tables created):
$ sudo mv /var/lib/cassandra/data/mykeyspace /otherlocation/mykeyspace
$ sstableloader -d <host> -f /etc/cassandra/cassandra.yaml -v /otherlocation/mykeyspace/tablename-<token>;

Cassandra bulk insert solution

I have a java program run as service , this program must insert 50k rows/s (1 row have 25 column ) to cassandra cluster.
My cluster contain 3 nodes, 1 node have 4 cpu core (core i5 2.4 ghz) , 4 gb ram.
i used Hector api, multithread, bulk insert but the performance is too low as expect (about 25k rows /s ).
Any one have suggest another solution for that. Is there cassandra support an internal bulk insert (without use Thrift).

Astyanax is a high level Java client for Apache Cassandra. Apache Cassandra is a highly available column oriented database.
Astyanax is currently in use at Netflix. Issues generally are fixed as quickly as possbile and releases done frequently.
https://github.com/Netflix/astyanax

I've had good luck creating sstables and loading them directly. There is a sstableloader
tool included in the distribution as well as a JMX interface. You can create the sstables using the SSTableSimpleUnsortedWriter class.
Details here.

The fastest way to bulk-insert data into Cassandra is sstableloader an utility provided by Cassandra in 0.8 onwards. For that you have to create sstables first which is possible with SSTableSimpleUnsortedWriter more about this is described here
Another faster way is Cassandras BulkoutputFormat for hadoop.With this we can write Hadoop job to load data to cassandra.See more on this bulkload to cassandra with hadoo

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string