How to validate the restoration is successful - cassandra

I have taken the full snapshot from a node. I have copied the snapshot directory and placed in the /var/lib/cassandra/data/Keyspace/Tables/ directory in the restoration node. I have tried both restarting the service and also tried using nodetool refresh command for restoring the data in new node. It worked like a charm.
I am unable to list the number of records for tables with high number of records. I am facing Connection timed out error for tables with higher records. So I am unable to validate that the total data from the table has been successfully restored.
Also I tried check the size occupied by the keyspace using nodetool cfstats -H and nodetool tablestats -H and "Space used" parameter seems to be exactly matching.
I use below command for listing the total count of the specific tables.
select count(*) from milestone LIMIT 100000;
My Question:
What if few of the records went missing during restoration? What if the count from the backup and restored data has mismatched and I have no way of knowing it. Could you please suggest the way to validate that the restoration is successful?
How will I ensure the total number of records have successfully copied?

Usually to validate data restoration, you may take a CSV backup of your data sets at the beginning and after restoration take one more CSV backup. Then compare these two backup, is there anything missing or not.
To compare to csv:
# diff mytable_old.csv mytable_new.csv
To know more about CQLSH COPY for csv backup: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html

Depending on your dataset size it might not be possible (reasonable?) to compare the full dataset.
Either using a random approach and query a % of the dataset.
If you do want to query the full dataset the best approach is to query all partitions one by one by token, and compare with the original dataset. You can look here https://github.com/ckalantzis/cassTickler for an example of how to query the full dataset. The objective is different, but the approach I'm recommending is the same.

Related

Cassandra - Truncate a table while inserts in progress

I want to understand how the truncate command works in Cassandra (version 3.9) to be able to know what would happen in the following scenario:
I have about 100GB of data on a table in production on a table that needs to be truncated.
I want to truncate this table, but at the same time there will be a few hundred requests per second that will be making inserts at the same time.
I am trying to understand, theoretically how would this play out.
Would the truncate try to acquire some sort of a lock on the table before it can proceed? and possibly stop the insert requests or itself be timed out?
Or would the truncate go through in sequence as the request came in and following insert requests would create the additional rows and I would end up with a small number of rows remaining after the truncate.
I am just trying to reclaim space, so I am not particularly concerned if a small amount of data remains from the insert requests run after the truncate command.
I am just trying to understand if you'd expect this to complete successfully or it would fail / time-out.
I will try to run a similar scenario on a smaller cluster, but I'm not sure if that will be a good substitute to understand the actual behavior. Any inputs will be helpful.
Truncate sends a message to all the nodes with a request to delete all the SSTables at the moment of execution, you will have information only of those upserts received after the truncate was issued.
In the Datastax documentation it is stated that this is done with JMX, but looking at the comments of this answer, this is done with CQL and the messaging service.
If you are trying to reclaim disk space, please note that a snapshot will be created with the truncate if auto_snapshot is set to true (true is the default value), so you will need to remove the snapshot after the execution of the command. Also, note that truncate will require to have all the nodes to be up and healthy to be able to complete.
I tried this for myself. On a 2 node Cassandra cluster I Made inserts at about 160 requests per second in the background and ran a truncate query on the same table that had about 200,000 records.
The table got truncated and the inserts continued without an error.
The new rows inserted after the truncate showed on the DB.

Is there a way to view data in 2 replicas in Cassandra?

I am a newbie to Cassandra.I have created a keyspace in Cassandra in NetworkTopology Strategy with 2 replicas in one datacenter. Is there a cql command or some other way to view my data in two replicas?
Like SELECT * FROM tablename in replica1 / replica2
Whether there is another way such that I can visually see the data in two replicas?
Thanks in advance.
So your question is not real clear "See the data in 2 replicas". If you ever want to validate your data, you can run some commands to visually see things.
The first thing you'd want to do is log onto the node you want to investigate. Go to the data directory of the interested table -> DataDir/keyspace/table. In there you'll see one or more files that look like *Data.db. Those are your sstables. Data in memory is flushed to sstables in certain scenarios. You want to be sure your data is flushed from memory to disk if you're validating (as you may not find what you're looking for otherwise). To do that, you issue a "nodetool flush" command (you can use the keyspace and table as parameters if you only want to flush the specific table).
Like I said, after that, everything in memory would be flushed to disk. So you'd be able to see your sstables (again, *Data.db) files. Once you have those sstables, you can run the "sstabledump" command on each sstable to see the data that resides in them, thus validating your data.
If you have only a few rows you want to validate and a lot of nodes, you can find which node the rows would reside by running "nodetool getendpoints" with the keyspace, table, and partition key. That will tell you every node that will have the data. That way you're not guessing which node the row(s) should be on. Unfortunately, there is no way to know which sstable the rows should exist in (and it could be more than one if updates/deletes, etc. occurred). You'll have to go through each sstable on the specific node(s).
Hope that helps answer your question?
Good luck.
-Jim
You can for a specific partition. If you are sure host1 is a replica (nodetool getendpoints or from query trace), then if you make your query with CL.ONE and explicitly to that host, the coordinator will always pick local first. So
Statement q = new SimpleStatement("SELECT * FROM tablename WHERE key = X");
q.setHost("host1")
Where host1 owns X.
For SELECT * FROM tablename its a bit harder because you are looking over entire data set and coordinator will send out multiple queries for each part of ring. If you do some queries with CL.ONE it will still only go to one node for each part of that range so if you set q.enableTracing() you can see what node answered for each range. You have no control over which coordinator picks so may take few queries.
If you just want to see if theres differences you can use preview repair. nodetool repair --preview --full.

Cassandra get backup of expired data

I have a typically very huge amount of data on one of my Cassandra table. I want to keep only last two months of data in my table for that i used TTL of two month on every data. But now i want to keep expired data as a backup for later use case. Please suggest me what should i do to take backup?
Data in Cassandra is stored as files on the disk. You could just copy those files off your production machines onto whatever storage medium you would like to restore them later. You can follow the link below to see how you would do this:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html

Drop table or truncate table in Cassandra, which is better

We have a use case where we need to re-create a table every day with current data in Cassandra. For this should we use drop table or truncate table, which would be efficient? We do not want the data to be backed up etc?
Thanks
Ankur
I think for almost all cases Truncate is a safer operation than a drop recreate. There have been several issues with dropping/recreating in the past with ghost data, schema disagreement, ect... Although there have been a number of fixes to try to make drop/recreate more stable, if its an operation you are performing every day Truncate should be much cheaper and more stable.
Drop table drops the table and all data. Truncate clears all data in the table, and by default creates a snapshot of the data (but not the schema). Efficiency wise, they're close - though truncate will create the snapshot. You can disable this by setting auto_snapshot to false in cassandra yaml config, but it is server wide. If it's not too much trouble, I'd drop and recreate table - but I've seen issues if you don't wait a while after drop before recreating.
Source : https://support.datastax.com/hc/en-us/articles/204226339-FAQ-How-to-drop-and-recreate-a-table-in-Cassandra-versions-older-than-2-1
NOTE: By default, snapshots are created when tables are dropped or truncated. This will need to be cleaned out manually to reclaim disk space.
Tested manually as well.
Truncate will keep the schema though, drop will not.
Beware!
From datastax documentation: https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cqlTruncate.html
Note: TRUNCATE sends a JMX command to all nodes, telling them to delete SSTables that hold the data from the specified table. If any of these nodes is down or doesn't respond, the command fails and outputs a message like the following:
truncate cycling.user_activity;
Unable to complete request: one or more nodes were unavailable.
Unfortunately, there is nothing on the documentation saying if DROP behaves differently

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Resources