how to create solr index backup and restore? - cassandra

we are creating the snapshot of all keyspaces of Cassandra. but also need the create a backup of solr index contains huge data, which is useful in Solr indexing.
Here is datastax link to create backup.
I tried we the following command
$nodetool -h localhost rebuild_index ks cf ks.cf
which is working fine for small data and takes more time for the huge size of data.
"Backup Solr Indexes" section in datastax doc.
and try to run:
$backup -d /var/lib/cassandra/data/solr.data -u root -v
and found this:
backup: Unrecognized or ambiguous switch '-d'; type 'backup help interactive' for detailed help.
means this backup package is not for the solr index. where we can find out suitable backup package?
Could someone suggest me how to create the backup and restore for solr index?

Assuming you'll be creating backups intended to restore a cluster with the same token layout, and you can make your backups in a rolling fashion, something like the following may at least be a starting point:
For each node...
1.) nodetool drain the node to make sure your Solr cores are in sync with their backing Cassandra tables. (drain forces a memtable flush, which forces a Solr hard commit.)
2.) Shut down the node.
3.) Manually back up your data directories (.../solr.data for your index).
4.) Start the node again.

Related

How do I restore a schema in Cassandra?

This is an example scenario and we wanted to understand if it would be possible to recover it. And also understand better about the schema.
In a hypothetical scenario of just 1 node, Cassandra 3.11. I have 1 keyspace and 1 table.
root#dd85fa9a3c41:/# cqlsh -k cycling -e "describe tables;"
rank_by_year_and_name
Now I reset my schema and restart Cassandra: (I have no nodes to replicate it again)
root#dd85fa9a3c41:/# nodetool resetlocalschema
With the new schema, I no longer "see" my keyspace+table:
root#dd85fa9a3c41:/# cqlsh -e "describe keyspaces;"
system_traces system_schema system_auth system system_distributed
I lost my original schema, where was my keyspace+table. But, they are still on disk:
root#dd85fa9a3c41:/# ls -l /var/lib/cassandra/data/cycling/
total 0
drwxr-xr-x 1 root root 14 Nov 22 11:32 rank_by_year_and_name-4eedbbf0
How could I restore that keyspace in this scenario? With sstableloader I could recreate keyspace+table and import.
I would like to recover this schema and see my keyspace+table again.
I haven't found any way to do this without manually recreating and importing with sstableloader.
Thank you if you help me!
On-disk data and schema are two different things in Cassandra.
To be able to restore a keyspace schema, you need first to back it up using nodetool snapshot. It will do a back up of the sstable (hard link) and create a schema.cql file containing the schema.
See the official doc here: https://cassandra.apache.org/doc/3.11/cassandra/operating/backups.html
I realise it's a hypothetical scenario but running resetlocalschema on a single-node cluster is a bad idea. The node is supposed to drop its copy of the schema and request the latest copy from other nodes but in the case of a single-node cluster, there are no nodes to get the schema from.
You really shouldn't run resetlocalschema on a single-node cluster unless you're doing some specific test or edge case activity as discussed in CASSANDRA-5094.
Now to your question on how you would restore the schema, most enterprises have a copy of their schema usually in a Change Management system (or CI/Config Management System). Before updates can be made to the schema in production, it usually goes through testing, peer-review, staging/pre-production validation, and finally deployed to production through an approved Change Request (terms might differ between organisations but the net intent is the same).
Similarly when you perform regular backups, the nodetool snapshot command stores a copy of the schema together with the SSTable backups. In this example I posted in https://dba.stackexchange.com/questions/316520/, you can see that the snapshots/ folder contains both a manifest.json (inventory of SSTables included in the snapshot) and a schema.cql (the schema at the time of the snapshot):
data/
community/
users-6140f420a4a411ea9212efde68e7dd4b/
snapshots/
1591083719993/
manifest.json
mc-1-big-CompressionInfo.db
mc-1-big-Data.db
mc-1-big-Digest.crc32
mc-1-big-Filter.db
mc-1-big-Index.db
mc-1-big-Statistics.db
mc-1-big-Summary.db
mc-1-big-TOC.txt
schema.cql
From the above you should be able to see that you have two options available:
recreate the schema from a copy that's been submitted/peer-reviewed in your Change Management System, or
recreate the schema from the snapshot.
The choice depends on what you're trying to achieve. Cheers!

Is it possible to backup and restore Cassandra cluster using dsbulk?

I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?
It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):
DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that nodetool snapshot just create a hardlinks to the files with data, no additional load to the nodes
It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of writetime function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. With nodetool snapshot, you just compare what files has changed since last backup, and backup only them.

what are the difference between the data back up using nodetool and cqlsh copy command?

Currently we have two options to take data back up of the tables in a Cassandra keyspace. We can either user nodetool commands or use the copy command from the cqlsh terminal.
1) What are the differences between these commands ?
2) Which one is most appropriate ?
3) Also if we are using nodetool to take backup we would generally flush the data from mem tables to sstables before we issue the nodetool snapshot command. So my question is should we employ the same techinque of flushing the data if we use the cqlsh copy command ?
Any help is appreciated.
Thanks very much.
GREAT question!
1) What are the differences between these commands ?
Running a nodetool snapshot creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:
ln {source} {link}
A cqlsh COPY is essentially the same as doing a SELECT * FROM on a table. It'll create a text file with the table's data in whichever format you have specified.
In terms of their difference from a backup context, a file created using cqlsh COPY will contain data from all nodes. Whereas nodetool snapshot needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.
2) Which one is most appropriate ?
It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot is the way to go. If you're trying to export/import data into a new table or cluster, then COPY is the better approach.
Also worth noting, cqlsh COPY takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.
3) Should we employ the same technique of flushing the data if we use the cqlsh copy command ?
No, that's not necessary. As cqlsh COPY works just like a SELECT, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.
nodetool snapshot is good approach for any amount of data and it creates a hard link within seconds.copy command will take much time because and depends upon the size of data and cluster. for less data and testing you may use copy command but for production nodetool snapshot is recommended.

Disk Space not freed up even after deleting keyspace from cassandra db and compaction

I created a keyspace and a table(columnfamily) within it.
Let's say "ks.cf"
After entering few hundred thousand rows in the columnfamily cf, I saw the disk usage using df -h.
Then, I dropped the keyspace using the command DROP KEYSPACE ks from cqlsh.
After dropping also, the disk usage remains the same. I also did nodetool compact, but no luck.
Can anyone help me out in configuring these things so that disk usage gets freed up after deleting the data/rows ?
Ran into this problem recently. After dropping a table a snapshot is made. This snapshot will allow you to roll this back if this was not intended. If you do however want that harddrive space back you need to run:
nodetool -h localhost -p 7199 clearsnapshot
on the appropriate nodes. Additionally you can turn snapshots off with auto_snapshot: false in your cassandra.yml.
edit: spelling/grammar
If you are just trying to delete rows, then you need to let the deletion step go through the usual delete cycle(delete_row->tombstone_creation->compaction_actually_deletes_the_row).
Now if you completely want to get rid of your keyspace, check your cassandra data folder(it should be specified in your yaml file). In my case it is "/mnt/cassandra/data/". In this folder there is a subfolder for each keyspace(i.e. ks ). You can just completely delete the folder related to your keyspace.
If you want to keep the folder around, it is good to know that cassandra creates a snapshot of your keyspace before dropping it. Basically a backup of all of your data. You can just go into 'ks' folder, and find the snapshots subdirectory. Go into the snapshots subdirectory and delete the snapshot related to your keyspace drop.
Cassandra does not clear snapshots automatically when your are dropping a table or keyspace. if you enabled auto_snapshot in the cassandra.yaml then every time when you drop a table or keyspace Cassandra will capture a snapshot of that table. This snapshot will help you to rollback this table data if this was not done by mistake. If you do clear those table data from disk then you need to run below clearsnapshot command to free space.
nodetool -u XXXX -pw XXXXX clearsnapshot -t snapshotname
You can disable this auto_snapshot feature any time in cassandra.yaml.
nodetool cleanup removes all data from disk that is not needed there any more, i.e. data that the node is not responsible for. (clearsnapshot will clear all snapshots, that may be not what you want.)
The nodetool command can be used to clean up all unused (i.e. previously dropped) tables snapshots in one go (here issued inside a running bitnami/cassandra:4.0 docker container):
$ nodetool --username <redacted> --password <redacted> clearsnapshot --all
Requested clearing snapshot(s) for [all keyspaces] with [all snapshots]
Evidence: space used by old tables snapshots in the dicts keyspace:
a) before the cleanup:
$ sudo du -sch /home/<host_user>/cassandra_data/cassandra/data/data/<keyspace_name>/
134G /home/<redacted>/cassandra_data/cassandra/data/data/dicts/
134G total
b) after the cleanup:
$ sudo du -sch /home/<host_user>/cassandra_data/cassandra/data/data/<keyspace_name>/
4.0K /home/<redacted>/cassandra_data/cassandra/data/data/dicts/
4.0K total
Note: the accepted answer missed the --all switch (and the need to log in), but it still deserves to be upvoted.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Resources