Cassandra get backup of expired data - cassandra

I have a typically very huge amount of data on one of my Cassandra table. I want to keep only last two months of data in my table for that i used TTL of two month on every data. But now i want to keep expired data as a backup for later use case. Please suggest me what should i do to take backup?

Data in Cassandra is stored as files on the disk. You could just copy those files off your production machines onto whatever storage medium you would like to restore them later. You can follow the link below to see how you would do this:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html

Related

cassandra: restoring partially lost data

Theoretical question:
Lets say I have a cassandra cluster with some data in it.
Backups are created on a daily basis.
Now a subset of data is being lost, either by application error or manual deletion.
What is the best way to restore data from existing backup?
I can think of starting a separate node with the backup disk attached, then export data manually through selects and reimport into the prod database.
That would work but sounds complicated, is there a more straight forward solution for such problems?
If its a single partition probably best bet is to use sstabledump or something like sstable-tools to read from it and just manually reinstert. If ok with restoring everything deleted from time of snapshot: reduce gcgrace to purge any tombstones with a force compact (or else they will continue to shadow the restored data) and use the sstable loader or if the token ranges are the same copy the backed up sstables back in the data directory.

Azure Data Lake incremental load with file partition

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2) How do I split or partition the files into smaller files?
3) How should I merge and load the deltas from source data into the files?
Thanks.
You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.
Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.
Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.
If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.
To summarize:
Load incrementally and name your incremental files intelligently
If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

How to store data in cassandra in a temporary location while sync is in progress?

We have a mysql server running which is serving application writes. To do some batch processing we have written a sync job to migrate data into cassandra cluster.
1. A daily sync job which transfers by updated timestamp for that day.
2. A complete sync job which transfers complete data, overriding existing ones.
Now there may be a possibility that the row was deleted from mysql, in that case using the above approach it will lie forever in cassandra.
To solve that problem we have given a TTL of 15 days for every row. So eventually it will get deleted, if it was not deleted then in next full sync the TTL will be over written again.
Its working fine as far as the use case is concerned but the issue is that in full sync complete data is over written and sstable is generated continuously with compactions happenning all the time, load averages shoot up with slowness and backup size increases (which could have been avoided).
Essentially we would want to replace the existing table data by new data but we dont want to truncate before starting the job but only after job completes.
Is there any way by which this can be solved other than creating a new table altogether and dropping past table when data is generated?
You can look at the double-run migration strategy I presented here: http://www.slideshare.net/doanduyhai/from-rdbms-to-cassandra-without-a-hitch
It has the advantage of allowing 100% uptime and possible rollback if things go wrong. The downside is the amount of work required in term of releases & codes

how to archive and purge Cassandra data

I have a cassandra cluster with multiple data centres. I want to archive data monthly and purge that data. There are numerous articles of backing up and restoring but not where its mentioned to archive data in cassandra cluster.
Can someone please let me know how can I archive my data in cassandra cluster monthly and purge the data.
I think there is no such tool that can be used for archive cassandra.You have to write either Spark Jobs or map reduce job that use CqlInputFormat to archive the data.You can follow below links that help you to understand how people are archiving data in cassandra:
[1] - [http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data]
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660
[3] - http://accelconf.web.cern.ch/AccelConf/ICALEPCS2013/papers/tuppc004.pdf
There is also a way using which you can turn on incremental backup in cassandra which can be used like CDC.
It is the best practice to use timewindow compaction strategy and set the window of monthly on your tables along with TTL(month), so that data older than a month can be purged.
If you write a purge job that does this work of deletion (on tables which do not have correct compaction strategy applied) then this can impact the cluster performance because searching the data on date/month basic will overwhelm the cluster.
I have experienced this, where we ultimately have to go back changing the structure of tables and altered the compaction strategy. That is why having the table design right at the first place is very important. We need to think about (in the beginning itself) not only how the data will be inserted and read in tables but also how it will be deleted and then frame the keys, compaction, ttl, etc.
For archiving just write a few lines of code to read data from Cassandra and put it to you archival location.
Let me know if this help in getting the end result you want or if you have further question that I can help with.

Resources