Cassandra data recovery from Commitlog file

Cassandra data recovery from Commitlog file - cassandra

I have below cassandra query ;
Few days ago i have developed application using c# and Single node Cassandra db. While the application in production, power failure occurred and cassandra commitlog got corrupt. Because of it cassandra node not starting, so i have shifted all commitlog files to another directory and started the cassandra node.
Recently i noticed the power failure day's data not available in database, I have all commitlog files with corrupted commitlog file name.
Can you please suggest, is there a way to recover data using commitlog files.
As well how to avoid commitlog file corruption issue, so that in production data loss can be avoid.
Thank you.

There is no way to restore back the node to the previous state if your commit logs have got corrupted and you have no SSTables with you.
If your commit logs are healthy (meaning it's not corrupted), then you just need to restart your node . It will be replayed,as a result will rebuild the memtable(s) and flush generation-1 SSTables on the disk.
What you can ideally do is to forcibly create SSTables.
You can do that under the apache-cassandra/bin directory by
nodetool flush
So if you are wary of losing commit logs .You can rebuild your node to previous states using SSTables so created above using
nodetool.bat refresh [keyspace] [columnfamily].
Alternatively you can also try creating snapshots.
nodetool snapshot
This command will take a snapshot of all keyspaces on the node.You also have the option of creating backups but this one will only keep record of the latest operations.
For more info try reading
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsNodetool_r.html
I suggest you can also try having more nodes and thus increase the replication factor to avoid such scenarios in future.
Hope it helps!

Related

From where cassandra fetch data for select?

I understand the concept of SSTable in Cassandra. I have also tested the different version of files create with insert after nodetool flush.
I have also setup a snapshot backup and incremental back and tested it's working fine.
For testing purpose i deleted all the sstable files from all the nodes. Strangely , am still able to select the data.
Can someone please explain me from where cassandra is fetching the data ?
Regards
Sid

The record you queried was available in the ROW cache aka memtables(memory).
So once you restart your node strangely you will again get back the result because the commit logs got replayed eventually building those SSTables for you.
Clear all the SSTables and commit logs and restart your node .And then you can observe that you get no records for your query .

Cassandra: Unable to read keyspace from data directory

I have a single-node Cassandra setup for my application. To reclaim disk space occupied by deleted records (tombstoned records), I triggered a nodetool compact for my keyspace. Unfortunately, this compaction process got interrupted. Now, when I try to re-start the service, it does not recognise the keyspace (from the data directory configured in cassandra.yaml) for which compaction was in progress when it got interrupted. Other keyspaces like system and system_traces are successfully initiated from the same data directory.
Has anybody encountered a similar issue before? Also, pointers to restore a keyspace only from data files would be of great help (for the lack of maintenance of snapshots).
PS: Upon analysing further it was found that an rm command on the cassandra data directory was issued but immediately cancelled. Most of the data seems to be in place, but there is a chance that the Data.db file of the system keyspace was lost. Is there a way to recover from this state?

Seems like you have corrupted your setup by deleting System keyspace files, hence Cassandra might not be checking the same at boot time.
Try this:
Download same version of cassandra again.
Create your keyspace & cf schemas
Move whatever old data is left to new data directory(cassandra will only load the non-corrupted data) -
sudo mv /data/cassandra_old/data/[keyspace]/[cf]-[md5-old]/* /data/cassandra_new/data/[keyspace]/[cf]-[md5-new]/
It should solve it if I understand the problem correctly.

Backups folder in Opscenter keyspace growing really huge

We have a 10 node Cassandra cluster. We configured a repair in Opscenter. We find there is a backups folder created for every table in Opscenter keyspace. It keeps growing huge. Is there a solution to this, or do we manually delete the data in each backups folder?

First off, Backups are different from snapshots - you can take a look at the backup documentation for OpsCenter to learn more.
Incremental backups:
From the datastax docs -
When incremental backups are enabled (disabled by default), Cassandra
hard-links each flushed SSTable to a backups directory under the
keyspace data directory. This allows storing backups offsite without
transferring entire snapshots. Also, incremental backups combine with
snapshots to provide a dependable, up-to-date backup mechanism.
...
As with snapshots, Cassandra does not automatically clear
incremental backup files. DataStax recommends setting up a process to
clear incremental backup hard-links each time a new snapshot is
created.
You must have turned on incremental backups by setting incremental_backups to true in cassandra yaml.
If you are interested in a backup strategy, I recommend you use the OpsCenter Backup Service instead. That way, you're able to control granularly which keyspace you want to back up and push your files to S3.
Snapshots
Snapshots are hardlinks to old (no longer used) SSTables. Snapshots protect you from yourself. For example you accidentally truncate the wrong keyspace, you'll still have a snapshot for that table that you can bring back. There are some cases when you have too many snapshots, there's a couple of things you can do:
Don't run Sync repairs
This is related to repairs because synchronous repairs generate a Snapshot each time they run. In order to avoid this, you should run parallel repairs instead (-par flag or by setting the number of repairs in the opscenter config file note below)
Clear your snapshots
If you have too many snapshots and need to free up space (maybe once you have backed them up to S3 or glacier or something) go ahead and use nodetool clearsnapshots to delete them. This will free up space. You can also go in and remove them manually from your file system but nodetool clearsnapshots removes the risk of rm -rf ing the wrong thing.
Note: You may also be running repairs too fast if you don't have a ton of data (check my response to this other SO question for an explanation and the repair service config levers).

How to restore Cassandra snapshots into smaller cluser

If I make snapshots on every node in 10 node cluster, how to restore them into 5 node cluster where every node has stronger CPU and more storeage ?

The traditional way to restore sstable backups is by copying the sstable files to the data directory and calling 'refresh' to load the data into the running cluster. If the topology has changed, if you're unable to access the data directory, if you have filename collisions, or if you don't have sufficient room or time to deal with lots of nodes having a ton of data they don't own, then nodetool refresh may be less than ideal.
However, cassandra includes a bundled tool called 'sstableloader', which reads sstables from disk and writes them into a running cluster. sstableloader may be a good fit to load data from your sstables into the cluster, without worrying about the changed topology.
More info is available: http://www.pythian.com/blog/bulk-loading-options-for-cassandra/

Data Storage Issue in cassandra

I m facing a problem, Cassandra is not storing the data but other commit-log is working. I saw as per seeing the configuration .yaml file. I have checked the folder and Cassandra has created a folder in MYKEYSPACENAME, but the data is not been stored there.
Is there something that I need to store the data? I m using the Cassandra 1.0.7 version.

It sounds like everything is working normally. When Cassandra receives data to write, it first writes to a commit log and to an in memory data structure (called a Memtable). Once the Memtable is full Cassandra will flush it to an SSTable on disk. You can force Cassandra to flush its Memtables using nodetool:
nodetool flush [keyspace] [cfnames]
This is not something that you need to do for normal operation of a Cassandra ring. Cassandra will eventually flush the Memtables to disk. If for some reason one of your Cassandra machines goes down, when it restarts it will replay the commit log so you will not lose any previously received writes.
The Cassandra wiki has more information on Memtables and SSTables.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string