Is it possible to recover deleted column data in cassandra? - cassandra

If we have deleted some(20 query) data in Cassandra using below delete query.
DELETE lastname FROM cycling.cyclist_name WHERE id = c7fceba0-c141-4207-9494-a29f9809de6f;
So how we can restore/find above deleted data in Cassandra? please help

If no compaction happened yet, then you may recover the data from SSTables via sstabledump and get them from generated JSON files.
But correct answer is to use some kind of backup solution - via OpsCenter, or using the manual backup via nodetool snapshot, etc. More information you can find in following article of DataStax support team.

Cassandra doesn't delete data immediately. As Alex hinted, it will still be in the sstables (data files) until compaction, and only marked with a deletion flag (tombstoned).
You can dump the contents of the sstables into text files and then search for your id.
Do something like this for each sstable:
sstabledump mc-3-big-Data.db > dump2019a
These text files will have your data, with a "deletion_info" flag. You can then search for your id and retrieve the data.
You should act quickly before compaction, though.

Related

How to recover deleted data in cassandra?

If I deleted a value in a cell from a table, but later, I want to recover it back, how can I do it?
I know that, cassandra doesn’t really delete data, it just mark it as deleted, so how can I recover the data?
Usecase for example: I want to delete all information from a user, so I first delete the information in cassandra database, then, I try to delete his information in somewhere else, but it comes to an error, so I have to stop the deletion process and recover the deleted data from cassandra database.
How can I do that?
Unfortunately not. You could however use sstabledump (Cassandra >= 3.0) to inspect sstable contents, but there are some drawbacks:
if the data was not flushed to disk (thus being in the memtable) it will be deleted before reaching to sstable
you need to find the sstable that the data belongs to
Probably there are some other drawbacks that I miss right now.
Some workarounds
first copy the data to another table and then perform the delete. After you delete the information from the other location, you can safely delete it from your backup table.
a new column ("pending_delete") where you would record the state. You would only query for your "live" data.
a new table where you would store the pk of the data to be deleted and delete it from the both tables after the operation on the other location is successful.
Choosing the right solution I guess depends on your use case and the size of your data.

Cassandra - recovery of data after accidental delete

As the data in case of Cassandra is physically removed during compaction, is it possible to access the recently deleted data in any way? I'm looking for something similar to Oracle Flashback feature (AS OF TIMESTAMP).
Also, I can see the pieces of deleted data in the relevant commit log file, however it's obviously unreadable. Is it possible to convert this file to a more readable format?
You will want to execute a restore from your commitlog.
The safest is to copy the commitlog to a new cluster (with same schema), and restore following the instructions (comments) from commitlog_archiving.properties file. In your case, you will want to set restore_point_in_time to a time between your insert and your delete.

When is it NOT necessary to truncate the table when restoring a snapshot (incremental) for Cassandra?

When is it NOT necessary to truncate the table when restoring a snapshot (incremental) for Cassandra?
All the different documentation "providers" including the 2nd edition of the Cassandra The Definitive Guide, it says something like this... "If necessary, truncate the table." If you restore without truncating (removing the tombstone), Cassandra continues to shadow the restored data. This behavior also occurs for other types of overwrites and causes the same problem.
If I have an insert only C* keyspace (no upserts and no deletes), do I ever need to truncate before restoring?
The documentation seems to imply that I can delete all of the sstable files from a column family (rm -f /data/.), copy the snapshot to /data/, and nodetool refresh.
Is this true?
You are right - you can restore a snapshot excatly this way. Copy over the sstables, restart the node and you are done. With incremental backups be sure you got all sstables with your data.
What could happen if you have updates and deletes is that after restoring a node or during restoring multiple nodes is that there is stale data available or you could run into problems with tombstones when data was deleted after the snapshot.
The magic with truncating tables is that all data is gone at once and you avoid such problems.

how to archive and purge Cassandra data

I have a cassandra cluster with multiple data centres. I want to archive data monthly and purge that data. There are numerous articles of backing up and restoring but not where its mentioned to archive data in cassandra cluster.
Can someone please let me know how can I archive my data in cassandra cluster monthly and purge the data.
I think there is no such tool that can be used for archive cassandra.You have to write either Spark Jobs or map reduce job that use CqlInputFormat to archive the data.You can follow below links that help you to understand how people are archiving data in cassandra:
[1] - [http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data]
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660
[3] - http://accelconf.web.cern.ch/AccelConf/ICALEPCS2013/papers/tuppc004.pdf
There is also a way using which you can turn on incremental backup in cassandra which can be used like CDC.
It is the best practice to use timewindow compaction strategy and set the window of monthly on your tables along with TTL(month), so that data older than a month can be purged.
If you write a purge job that does this work of deletion (on tables which do not have correct compaction strategy applied) then this can impact the cluster performance because searching the data on date/month basic will overwhelm the cluster.
I have experienced this, where we ultimately have to go back changing the structure of tables and altered the compaction strategy. That is why having the table design right at the first place is very important. We need to think about (in the beginning itself) not only how the data will be inserted and read in tables but also how it will be deleted and then frame the keys, compaction, ttl, etc.
For archiving just write a few lines of code to read data from Cassandra and put it to you archival location.
Let me know if this help in getting the end result you want or if you have further question that I can help with.

Cassandra - SSTable Readpth & Write Path

Hello Cassandra specialists!!!
How will go about analyzing Read path and Write path for Cassandra?
Basically, I would like to know what is read path and write path measurements and if i give some sample row keys , how will I know, how many SStables are there currently for that particular Row Key and where is that located about details? Based on this detail I would like know what is causing slowness and for read path and what can be improved.
I am particularly interested in knowing How many SSTables are there for particular RowKey and where they are located? This is one of the POC that I am working on for the client.
Thanks in advance...
Another possible mechanism could be to use
nodetool getsstables <keyspace><cf><key
which will give you the list of sstables which contain the key.
Another option is to write a script which sort of runs/repeats the sstable2json on all the files in the folder and greps for the specific key. Once the key is found, you can write that file into some sort of datastructure and then later obtain the details of the same from the datastructure.
The Docs tend to be a good starting point:
Write path to compaction
About the read path
how will I know, how many SStables are there currently for that particular Row Key
You can use sstable2json to view the raw information contained within sstables in a json format and hunt down particular keys to see how they are distributed across replicas. This will be fiddly especially if you have a lot of sstables, so you can use nodetool getendpoints <keyspace> <table> <key> to workout which node actually owns the key and then start converting sstables to json.

Resources