As the data in case of Cassandra is physically removed during compaction, is it possible to access the recently deleted data in any way? I'm looking for something similar to Oracle Flashback feature (AS OF TIMESTAMP).
Also, I can see the pieces of deleted data in the relevant commit log file, however it's obviously unreadable. Is it possible to convert this file to a more readable format?
You will want to execute a restore from your commitlog.
The safest is to copy the commitlog to a new cluster (with same schema), and restore following the instructions (comments) from commitlog_archiving.properties file. In your case, you will want to set restore_point_in_time to a time between your insert and your delete.
Related
We deleted some old data within our 3 node cassandra cluster (v3.11) some days ago which shall now be restored from a Snapshot. Is there a possibility to restore the data from the snapshot without loosing updates made since the snapshot was taken?
There are two approaches which come to my mind
A)
Create export via COPY keypsace.table TO xy.csv
Truncate table
restore table from snapshot via sstableloader
Reimport newer data via COPY keypsace.table FROM xy.csv
B)
Just copy sstable files of snapshot into current table directory
Is A) a feasible option? What do we need to consider so that the COPY FROM/TO commands get synchronized over all nodes?
For option B) I read that the deletion commands that happend may be executed again (tombstone rows). Can I ignore this warning if we make sure the deletion commands happened more than 10 days ago (gc_grace_seconds)?
for exporting/importing data from Apache Cassandra®, there is an efficient tool -- DataStax Bulk Loader (aka DSBulk). You could refer to more documentation and examples here. For getting consistent reads and writes, you could leverage --datastax-java-driver.basic.request.consistency LOCAL_QUORUM in your unload & load commands.
Theoretical question:
Lets say I have a cassandra cluster with some data in it.
Backups are created on a daily basis.
Now a subset of data is being lost, either by application error or manual deletion.
What is the best way to restore data from existing backup?
I can think of starting a separate node with the backup disk attached, then export data manually through selects and reimport into the prod database.
That would work but sounds complicated, is there a more straight forward solution for such problems?
If its a single partition probably best bet is to use sstabledump or something like sstable-tools to read from it and just manually reinstert. If ok with restoring everything deleted from time of snapshot: reduce gcgrace to purge any tombstones with a force compact (or else they will continue to shadow the restored data) and use the sstable loader or if the token ranges are the same copy the backed up sstables back in the data directory.
If we have deleted some(20 query) data in Cassandra using below delete query.
DELETE lastname FROM cycling.cyclist_name WHERE id = c7fceba0-c141-4207-9494-a29f9809de6f;
So how we can restore/find above deleted data in Cassandra? please help
If no compaction happened yet, then you may recover the data from SSTables via sstabledump and get them from generated JSON files.
But correct answer is to use some kind of backup solution - via OpsCenter, or using the manual backup via nodetool snapshot, etc. More information you can find in following article of DataStax support team.
Cassandra doesn't delete data immediately. As Alex hinted, it will still be in the sstables (data files) until compaction, and only marked with a deletion flag (tombstoned).
You can dump the contents of the sstables into text files and then search for your id.
Do something like this for each sstable:
sstabledump mc-3-big-Data.db > dump2019a
These text files will have your data, with a "deletion_info" flag. You can then search for your id and retrieve the data.
You should act quickly before compaction, though.
Looking in my keyspace directory I see several versions of most of my tables. I am assuming this is because I dropped them at some point and recreated them as I was refining the schema.
table1-b3441432142142sdf02328914104803190
table1-ba234143018dssd810412asdfsf2498041
These created tables names are very cumbersome to work with. Try changing to one of the directories without copy pasting the directory name from the terminal window... Painful. So easy to mistype something.
That side note aside, how do I tell which directory is the most current version of the table? Can I automatically delete the old versions? I am not clear if these are considered snapshots or not since each directory also can contain snapshots. I read in another post you can stop autosnapshot, but I'm not sure I want that. I'd rather just automatically delete any tables not being currently used (i.e.: that are not the latest version).
I stumbled across this trying to do a backup. I realized I am forced go to every table directory and copy out the snapshot files (there are like 50 directories..not including all the old table versions) which seems like a terrible design (maybe I'm missing something??).
I assumed I could do a snapshot of the whole keyspace and get one file back or at least output all the files to a single directory that represents the snapshot of the entire keyspace. At the very least it would be nice knowing what the current versions are so I can grab the correct files and offload them to storage somewhere.
DataStax Enterprise has a backup feature but it only supports AWS and I am using Azure.
So to clarify:
How do I automatically delete old table versions and know which is
the current version?
How can I backup the most recent versions of the tables and output the files to a single directory that I can offload somewhere? I only have two nodes, so simply relying on the repair is not a good option for me if a node goes down.
You can see the active version of a table by looking in the system keyspace and checking the cf_id field. For example, to see the version for a table in the 'test' keyspace with table name 'temp', you could do this:
cqlsh> SELECT cf_id FROM system.schema_columnfamilies WHERE keyspace_name='test' AND columnfamily_name='temp' allow filtering;
cf_id
--------------------------------------
d8ea9830-20e9-11e5-afc0-c381f961c62a
As far as I know, it is safe to delete (rm -r) outdated table version directories that are no longer active. I imagine they don't delete them automatically so that you can recover the data if you dropped them by mistake. I don't know of a way to have them removed automatically even if auto snapshot is disabled.
I don't think there is a command to write all the snapshot files to a single directory. According to the documentation on snapshot, "After the snapshot is complete, you can move the backup files to another location if needed, or you can leave them in place." So it's left up to the application developer how they want to handle archiving the snapshot files.
I have enabled the incremental backup in the cassandra.yaml file. As I know when we enable incremental backups, cassandra will backup the data (in backups directory) only when the memtable is flushed. But what if the memtable is yet to be flushed? I won't be able to get the incremental backup right?. I know that for the memtable to be flushed there are certain conditions to be met such as time interval or memtable space. My question is how do I modify this so that even if I enter one record after the last snapshot, I can still backup entire data along with that latest entry?
Consider this example
Take the snapshot.
Clear incremental backup (backups directory)
Enter a record to a table.
Check for the incremental backup in backups directory. It is still empty.
Now how do I backup the record which is written after the last snapshot?In general how do we backup the entire upto-date data unless we take the snapshot?
You can flush the files manually with nodetool flush just before taking the backup. That way you'll always have the latest memtable flushed.
nodetool docs
If you want to backup a cluster without taking a snapshot you can do it by simply saving everything under /data folder from every node (this includes mainly the .db files stats files etc).
In order to not override files you should store it with the token information as well.
When you want to restore from this backup, you should spin up a cluster with the same number of nodes, and simply copy the data, one-to-one from each backed-up node to a restored node. Pay attention that you'll have to modify cassandra.yaml to include the relevant token in cassandra.yaml (as well as the peers/seeds/etc) for each restored node.
After all the data is copied, you can start C* process on all the nodes.