Greenplum Incremental backup through gpbackup takes cummulative backup - database-administration

I am using greenplum DB's gpbackup utility to take incremental backup of database, through following commands
1: Full backup
gpbackup --dbname incdb --backup-dir /data/gpbackups --leaf-partition-data
2: Then I added some rows and took incremental backup as:
gpbackup --dbname incdb --backup-dir /data/gpbackups --leaf-partition-data --incremental
But when I go to the backup folder, unzip the backed up files and read them in terminal, I see incremental backup files have all the data from start, instead of only changed data. Shouldn't it have only the data added after the full backup?
Please also guide me if remote backups are possible through gpbackup utility?

You can find more details here, regarding the implementation of Greenplum's Incremental Backup
https://gpdb.docs.pivotal.io/backup-restore/1-16/admin_guide/managing/backup-gpbackup-incremental.html
In short, incremental backups work best on AO partitioned tables.
gpbackup will identify which partitions of AO tables have data changes since the last full or incremental backup, and add those to the backup set.
Heap tables are always fully backed up, regardless of the incremental flag is used.
In your use case, were you using AO partitioned tables?

Related

Is it possible to backup and restore Cassandra cluster using dsbulk?

I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?
It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):
DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that nodetool snapshot just create a hardlinks to the files with data, no additional load to the nodes
It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of writetime function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. With nodetool snapshot, you just compare what files has changed since last backup, and backup only them.

how to create solr index backup and restore?

we are creating the snapshot of all keyspaces of Cassandra. but also need the create a backup of solr index contains huge data, which is useful in Solr indexing.
Here is datastax link to create backup.
I tried we the following command
$nodetool -h localhost rebuild_index ks cf ks.cf
which is working fine for small data and takes more time for the huge size of data.
"Backup Solr Indexes" section in datastax doc.
and try to run:
$backup -d /var/lib/cassandra/data/solr.data -u root -v
and found this:
backup: Unrecognized or ambiguous switch '-d'; type 'backup help interactive' for detailed help.
means this backup package is not for the solr index. where we can find out suitable backup package?
Could someone suggest me how to create the backup and restore for solr index?
Assuming you'll be creating backups intended to restore a cluster with the same token layout, and you can make your backups in a rolling fashion, something like the following may at least be a starting point:
For each node...
1.) nodetool drain the node to make sure your Solr cores are in sync with their backing Cassandra tables. (drain forces a memtable flush, which forces a Solr hard commit.)
2.) Shut down the node.
3.) Manually back up your data directories (.../solr.data for your index).
4.) Start the node again.

Cassandra restore from incremental backup files from multilple nodes

I am also looking for incremental/point-in-time backup / restore solution.
I have three Cassandra nodes,I enabled incremental backup , and i tried copy one day's SSTable files from backups folder on one node to a new cassandra cluster /data folder, then it works, but i have three node, and the name on all three node are same, i dont' know how to restore the incremental backup files from all the three nodes.
You comments are really appreciated !
One simple solution is to config new cluster with exact same node, and everything can work.
But if i want to replay the data with less nodes, like only one, so i have to take a complete snapshot ?

How to modify the memtable flush time interval in cassandra?

I have enabled the incremental backup in the cassandra.yaml file. As I know when we enable incremental backups, cassandra will backup the data (in backups directory) only when the memtable is flushed. But what if the memtable is yet to be flushed? I won't be able to get the incremental backup right?. I know that for the memtable to be flushed there are certain conditions to be met such as time interval or memtable space. My question is how do I modify this so that even if I enter one record after the last snapshot, I can still backup entire data along with that latest entry?
Consider this example
Take the snapshot.
Clear incremental backup (backups directory)
Enter a record to a table.
Check for the incremental backup in backups directory. It is still empty.
Now how do I backup the record which is written after the last snapshot?In general how do we backup the entire upto-date data unless we take the snapshot?
You can flush the files manually with nodetool flush just before taking the backup. That way you'll always have the latest memtable flushed.
nodetool docs
If you want to backup a cluster without taking a snapshot you can do it by simply saving everything under /data folder from every node (this includes mainly the .db files stats files etc).
In order to not override files you should store it with the token information as well.
When you want to restore from this backup, you should spin up a cluster with the same number of nodes, and simply copy the data, one-to-one from each backed-up node to a restored node. Pay attention that you'll have to modify cassandra.yaml to include the relevant token in cassandra.yaml (as well as the peers/seeds/etc) for each restored node.
After all the data is copied, you can start C* process on all the nodes.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Resources