How to force the Commit log's commit to memtable? - cassandra

I'm using the sstableloader's command to load the backup and I note that the new datas not load. Analising the files, the new datas still in the commit log and not in the backup. How could I force the "commit" the commit log datas in the memtables?

So normally, I'd say try running a nodetool flush. If data is getting flushed from memory to disk, it should remove the corresponding commitlog file.
If that's not happening, the next thing I would try is stopping the node with nodetool drain and then stopping the Cassandra process. When you restart it, the node will process the commitlogs and reconcile them with data stored on disk.

Related

Cassandra write semantics

In Cassandra architecture, when we perform a write operation, data is first written in commit log, then into memtable, and when memtable reaches threshold, data is flushed into SSTable.
So at a given time we have 2 copies of data in a given node: one copy is in commit log and another copt is either in memtable or flushed to SSTable.
So why do we need to have 2 copies? Isn't commit log enough for recovery purposes? Or do they serve totally different purposes? And how are these 3 different from each other?
When you write, Cassandra saves the data to both commit log and Memtable, that makes the operation very fast. If the node restarts before the data is saved to the persistent SSTable, the data in memory is lost, but can be recovered from the commit log.
So Cassandra uses Memtables and SStables for lookup, and commit logs allows restarting a node at any moment without losing the data.

Could not read commit log descriptor in file

I started to use cassandra 3.7 and always I have problems with the commitlog. When the pc unexpected finished by a power outage for example the cassandra service doesn't restart. I try to start for the command line, but always the error cassandra could not read commit log descriptor in file appears.
I have to delete all the commit logs to start the cassandra service. The problem is that I lose a lot of data. I tried to increment the replication factor to 3, but is the same.
What I can do to decrease amount of lost data?
pd: I only one pc to use cassandra database, it is not possible to add more pcs.
I think your option here is to work around the issue since its unlikely there is a guaranteed solution to prevent commit table files getting corrupted on sudden power outage. Since you only have a single node, it makes it more difficult to recover the data. Increasing the replication factor to 3 on a single node cluster is not going to help.
One thing you can try is to reduce the frequency at which the memtables are flushed. On flush of memtable the entries in the commit log are discarded, therefore reducing the amount of data lost. Details here. This will however not resolve the root issue

How to modify the memtable flush time interval in cassandra?

I have enabled the incremental backup in the cassandra.yaml file. As I know when we enable incremental backups, cassandra will backup the data (in backups directory) only when the memtable is flushed. But what if the memtable is yet to be flushed? I won't be able to get the incremental backup right?. I know that for the memtable to be flushed there are certain conditions to be met such as time interval or memtable space. My question is how do I modify this so that even if I enter one record after the last snapshot, I can still backup entire data along with that latest entry?
Consider this example
Take the snapshot.
Clear incremental backup (backups directory)
Enter a record to a table.
Check for the incremental backup in backups directory. It is still empty.
Now how do I backup the record which is written after the last snapshot?In general how do we backup the entire upto-date data unless we take the snapshot?
You can flush the files manually with nodetool flush just before taking the backup. That way you'll always have the latest memtable flushed.
nodetool docs
If you want to backup a cluster without taking a snapshot you can do it by simply saving everything under /data folder from every node (this includes mainly the .db files stats files etc).
In order to not override files you should store it with the token information as well.
When you want to restore from this backup, you should spin up a cluster with the same number of nodes, and simply copy the data, one-to-one from each backed-up node to a restored node. Pay attention that you'll have to modify cassandra.yaml to include the relevant token in cassandra.yaml (as well as the peers/seeds/etc) for each restored node.
After all the data is copied, you can start C* process on all the nodes.

Many commitlog files with Cassandra

I've got a mostly idle (till now) 4 node cassandra cluster that's hooked up with opscenter. There's a table that's had very little writes in the last week or so (test cluster). It's running 2.1.0. Happened to ssh in, and out of curiosity, ran du -sh * on the data directory. Here's what I get:
4.2G commitlog
851M data
188K saved_caches
There's 136 files in the commit log directory. I flushed, and then drained cassandra, stopped and started the service. Those files are still there. What's the best way to get rid of these? Most of the stuff is opscenter related, and I'm inclined to just blow them away as I don't need the test data. Wondering what to do in case this pops up again. Appreciate any tips.
The files in the commit log directory have a fixed size determined by your settings in the cassandra.yaml. All files have a pre-allocated size, so you cannot change it by making flush, drain or other operations on the cluster.
You have to change the configuration if you want to reduce their size.
Look at the configuration settings "commitlog_total_space_in_mb" and "commitlog_segment_size_in_mb" to configure the size of each file and the total space occupied by all of them.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Resources