how to take a keyspace as a dump in cassandra? - cassandra

I need to take a keyspace from the server as a dump and restore that dump to my local cassandra,
I know to do in mysql but how to do in nosql ?
I learn from site that nodetool ,snapshot and csv file format can achieve this,but I unable to got it ?

You can do this with "nodetool". For a good reference documentation take a look here: http://www.datastax.com/docs/1.1/backup_restore
Roughly you need to perform the following steps:
take a "snapshot" of the keyspace using: nodetool snapshot <keyspace-name>. This is run on the server, where you want to take generates a "snapshot". It will do that, storing a "snapshot" for each table of the keyspace.
copy the "snapshots" to your local server. Do this for each keyspace table: <cassandra-dir>/data/<keyspace-name>/<table-name>/snapshots/ (look for the "latest" taken snapshot - when you take the snapshot it tells you the "name"/"ID" of the snapshot taken).
in your local server, before you place the "server" snapshots do the following: stop cassandra, delete the content of that "keyspace"(again for each keyspace table: <cassandra-dir>/data/<keyspace>/<table-name>/) and then place the "server" snapshots in each respective "keysapce table" (directly in the <cassandra-dir>/data/<keyspace>/<table-name>/ and not in the "snapshot" directory).
restart the local server, and you should have the data from the server in your local server.
HTH.

To do this by snapshot..
Command for taking the snapshot-
<-path to cassandra's bin folder> nodetool -h <-server host name/ IP> -p <-server port> snapshot
This will create a SNAPSHOT directory in VAR folder and this directory contains the snapshot
of server's current database which you can use as dump for your local server.

While nodetool is the preferred way, however if you don't have direct access to the underlying file structure, I would recommend using something like: cassandradump
$ python cassandradump.py --keyspace system --export-file dump.cql
Exporting schema for keyspace system
Exporting schema for column family system.peers
Exporting data for column family system.peers
Exporting schema for column family system.range_xfers
Exporting data for column family system.range_xfers
Exporting schema for column family system.schema_columns
Exporting data for column family system.schema_columns
...

Related

How do I restore a schema in Cassandra?

This is an example scenario and we wanted to understand if it would be possible to recover it. And also understand better about the schema.
In a hypothetical scenario of just 1 node, Cassandra 3.11. I have 1 keyspace and 1 table.
root#dd85fa9a3c41:/# cqlsh -k cycling -e "describe tables;"
rank_by_year_and_name
Now I reset my schema and restart Cassandra: (I have no nodes to replicate it again)
root#dd85fa9a3c41:/# nodetool resetlocalschema
With the new schema, I no longer "see" my keyspace+table:
root#dd85fa9a3c41:/# cqlsh -e "describe keyspaces;"
system_traces system_schema system_auth system system_distributed
I lost my original schema, where was my keyspace+table. But, they are still on disk:
root#dd85fa9a3c41:/# ls -l /var/lib/cassandra/data/cycling/
total 0
drwxr-xr-x 1 root root 14 Nov 22 11:32 rank_by_year_and_name-4eedbbf0
How could I restore that keyspace in this scenario? With sstableloader I could recreate keyspace+table and import.
I would like to recover this schema and see my keyspace+table again.
I haven't found any way to do this without manually recreating and importing with sstableloader.
Thank you if you help me!
On-disk data and schema are two different things in Cassandra.
To be able to restore a keyspace schema, you need first to back it up using nodetool snapshot. It will do a back up of the sstable (hard link) and create a schema.cql file containing the schema.
See the official doc here: https://cassandra.apache.org/doc/3.11/cassandra/operating/backups.html
I realise it's a hypothetical scenario but running resetlocalschema on a single-node cluster is a bad idea. The node is supposed to drop its copy of the schema and request the latest copy from other nodes but in the case of a single-node cluster, there are no nodes to get the schema from.
You really shouldn't run resetlocalschema on a single-node cluster unless you're doing some specific test or edge case activity as discussed in CASSANDRA-5094.
Now to your question on how you would restore the schema, most enterprises have a copy of their schema usually in a Change Management system (or CI/Config Management System). Before updates can be made to the schema in production, it usually goes through testing, peer-review, staging/pre-production validation, and finally deployed to production through an approved Change Request (terms might differ between organisations but the net intent is the same).
Similarly when you perform regular backups, the nodetool snapshot command stores a copy of the schema together with the SSTable backups. In this example I posted in https://dba.stackexchange.com/questions/316520/, you can see that the snapshots/ folder contains both a manifest.json (inventory of SSTables included in the snapshot) and a schema.cql (the schema at the time of the snapshot):
data/
community/
users-6140f420a4a411ea9212efde68e7dd4b/
snapshots/
1591083719993/
manifest.json
mc-1-big-CompressionInfo.db
mc-1-big-Data.db
mc-1-big-Digest.crc32
mc-1-big-Filter.db
mc-1-big-Index.db
mc-1-big-Statistics.db
mc-1-big-Summary.db
mc-1-big-TOC.txt
schema.cql
From the above you should be able to see that you have two options available:
recreate the schema from a copy that's been submitted/peer-reviewed in your Change Management System, or
recreate the schema from the snapshot.
The choice depends on what you're trying to achieve. Cheers!

what are the difference between the data back up using nodetool and cqlsh copy command?

Currently we have two options to take data back up of the tables in a Cassandra keyspace. We can either user nodetool commands or use the copy command from the cqlsh terminal.
1) What are the differences between these commands ?
2) Which one is most appropriate ?
3) Also if we are using nodetool to take backup we would generally flush the data from mem tables to sstables before we issue the nodetool snapshot command. So my question is should we employ the same techinque of flushing the data if we use the cqlsh copy command ?
Any help is appreciated.
Thanks very much.
GREAT question!
1) What are the differences between these commands ?
Running a nodetool snapshot creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:
ln {source} {link}
A cqlsh COPY is essentially the same as doing a SELECT * FROM on a table. It'll create a text file with the table's data in whichever format you have specified.
In terms of their difference from a backup context, a file created using cqlsh COPY will contain data from all nodes. Whereas nodetool snapshot needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.
2) Which one is most appropriate ?
It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot is the way to go. If you're trying to export/import data into a new table or cluster, then COPY is the better approach.
Also worth noting, cqlsh COPY takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.
3) Should we employ the same technique of flushing the data if we use the cqlsh copy command ?
No, that's not necessary. As cqlsh COPY works just like a SELECT, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.
nodetool snapshot is good approach for any amount of data and it creates a hard link within seconds.copy command will take much time because and depends upon the size of data and cluster. for less data and testing you may use copy command but for production nodetool snapshot is recommended.

nodetool snapshot takes schema snapshot (backup) too?

Cassandra doc mentions that "nodetool snapshot" command takes snapshot of table data. However, I am also able to see schema.cql and manifest.json file in my snapshot directory where all snapshot files are generated.
Is this expected behavior? Also can I use this schema.cql file to restore the schema if needed?
My cassandra version
cqlsh> show version
[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
>nodetool version
ReleaseVersion: 3.0.9
EDIT:
Is it mandatory to use cql file from snapshot while restoring data? Suppose I have create table cql stored somewhere else. Can I use that?
I performed some tests. When I re-created table using cql from snapshot, ID in table name remains same "employee-42a71380966111e8870f97a01282a56a". However when I re-created table using my original cql, ID in table name changed. Can this be a problem and that's why we should use cql from snapshot?
Note-: When I restored data from snapshot, it loaded fine in both above cases
This cql file is for table. Can we get cql from snapshot to create keyspace?
Does cql file gets generated only for user defined table? I can't see cql file getting generated for system tables..
Yes, these files are necessary for restoring this particular table. And schema.cql captures the structure of table on the time of the snapshot because you need to restore snapshot to table with the same structure.
You can find more detailed description in the DataStax documentation.
Update after addition of more questions:
Presence of schema in snapshot makes life easier - quite often the schema evolve, and you can use non-snapshot schema if you guarantee that schema will match to data in snapshot;
nodetool snapshot generates only table's schemas
It's better not to mess-up with system tables...
Here is detailed knowledge base article from DataStax support about backup/restore.
Doc link you have given is for apache Cassandra, while the answer given is with reference to Datastax, I have done taking snaphosts and restore it back in apache-cassandra 2.0.4, It doesn't take any schema backup. All schemas need to be copied separately and need to be created manually in new cluster.

how to create solr index backup and restore?

we are creating the snapshot of all keyspaces of Cassandra. but also need the create a backup of solr index contains huge data, which is useful in Solr indexing.
Here is datastax link to create backup.
I tried we the following command
$nodetool -h localhost rebuild_index ks cf ks.cf
which is working fine for small data and takes more time for the huge size of data.
"Backup Solr Indexes" section in datastax doc.
and try to run:
$backup -d /var/lib/cassandra/data/solr.data -u root -v
and found this:
backup: Unrecognized or ambiguous switch '-d'; type 'backup help interactive' for detailed help.
means this backup package is not for the solr index. where we can find out suitable backup package?
Could someone suggest me how to create the backup and restore for solr index?
Assuming you'll be creating backups intended to restore a cluster with the same token layout, and you can make your backups in a rolling fashion, something like the following may at least be a starting point:
For each node...
1.) nodetool drain the node to make sure your Solr cores are in sync with their backing Cassandra tables. (drain forces a memtable flush, which forces a Solr hard commit.)
2.) Shut down the node.
3.) Manually back up your data directories (.../solr.data for your index).
4.) Start the node again.

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Resources