I just setup a fresh windows server with a fresh datastax installation including cassandra 1.2 and opscenter 2.1.3. I've tried finding solutions to these questions on cassandra wikis and datastax website, but I can only find unix specific information or datastax API information.
Cassandra is defaulted to using C: drive (I was never asked to select a drive for cassandra during install).
In the same cassandra instance, can I have keyspaces on separate
disks?
If not, how do I migrate the existing keyspace to the new
drive? (just reconfiguring cassandra.yaml to use a new directory
would lose my opscenter data and may even break opscenter).
If yes, how can I create a new keyspace on a separate drive? cassandra.yaml
seems to only have configuration options for a single store location.
Should I be creating a new cluster to store my data in? If I start
adding new nodes to the default cluster, that will mean the datastax
opscenter data will be getting replicated - that seems like a bad
idea.
If there is good documentation on this somewhere, please point me there.
Thanks,
Adam
You cannot get cassandra to split the keyspaces and store them in different directories. They are all stored under a common data directory that is specified in the cassandra.yaml file.
However, you can set this up and use NTFS to mount different drives under the data directory on your server but this will not be simple or expandable.
If you want to move where the data is stored on cassandra, then stop the cassandra daemon/service, change the cassandra.yaml file to store the data at a new location, then copy/move the entirety of the data directory to this new location. THEN start cassandra back up and it will work fine with the data in the new location. I have done this quite a few times now and cassandra comes back up without incident and no lost data (if you do not move the data, then it will lose it all and recreate the directory structure under the new location).
Data getting replicated is not a bad thing - it is what cassandra was designed for. I don't know what replication factor opscenter uses, but it does not store a massive amount of data so replication is not a problem.
Related
I am working on a cassandra backup and recovery strategy for our cassandra system and am trying to understand how the backup and sstable recovery works in cassandra. Here are of my observations and related questions (my need is to setup a standby/backup cluster which would become active if the primary cluster goes down.. so I want to keep them in sync in terms of data, so I want to take periodic backups at my active cluster and recover to the standby cluster)
Took a snapshot backup. Dropped a table in cassandra. Stopped cassandra, recovered from the snapshot backup (copied the sstables to the data/ folder), started cassandra. Ran cqlsh on the node, and I still do not see the table created. Should this work? Am I missing any step ?
In the above scenario, I then tried to re-setup the schema (I take backup of the schema in the snapshot) using the cql commant source . This created the table for me. However it creates a "new version" of table for me. When I recover the snapshot has the older version (different uuid labelled folders for table). After recovery, I still see no data in the table. Possibly because I created a new table?
I was finally able to recover data after running nodetool repair and using sstableloader to restore table data from another node in the cluster.
My question is
a. what is the right way to setup a new (blank- no schema) cluster from a snapshot? How do you setup the schema and recover data?
b. how do you restore a cluster from a backup with table alterations. How do you bring a cluster running an older version of schema to a newer version of schema when recovering from a backup (snapshot or incremental)?
(NOTE: cassandra newbie here)
So if you want to restore a snapshot, you need to copy the snapshot files back to the sstable directory and then run: nodetool refresh. You can read:
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsBackupSnapshotRestore.html
for more information. Also, there are 3rd party tools that can back up your data and then restore it as it was at the time of the backup. We use a tool: Cohesity (formally Talena/Imanis). It has a lot of capabilities (refreshing A to B, restore/rename, etc.). There are other popular ones as well. All of them have costs associated with them.
Hopefully that helps?
-Jim
We have a regular backup of our cluster and we store schema and snapshot back up on aws s3 on daily basis.
Somehow we have lost all the data and while recovering the data from backup we are able to recover schema but while copying snapshots files to /var/lib/cassandra/data directory its not showing up the data in the tables.
After copying the data we have done nodetool refresh -- keyspace table but still nothing is working out.
could you please help on this ?
Im new at Apache Cassandra, but my first focus at this topic was the Backup.
If you want to restore from a Snapshot (on new node/cluster) you have to shut down Cassandra on any node and clear any existing data from these folders:
/var/lib/cassandra/data -> If you want to safe your System Keyspaces so delete only your Userkeyspaces folders
/var/lib/cassandra/commitlog
/var/lib/cassandra/hints
/var/lib/cassandra/saved_cashes
After this, you have to start Cassandra again (the whole Cluster). Create the Keyspace like the one you want to restore and the table you want to restore. In Your Snapshot folder you will find a schema.cql script for the creation of the table.
After Creating the Keyspaces an tables again, wait a moment (time depends on the ammount of nodes in your cluster and keypsaces you want to restore.)
Shut down the Cassandra Cluster again.
Copy the Files from the Snapshot folder to the new folders of the tables you want to restore. Do this on ALL NODES!
After copying the files, start the nodes one by one.
If all nodes are running, run the nodetool repair command.
If you try to check the data via CQLSH, so think of the CONSISTENCY LEVEL! (ALL/QUORUM)
Thats the way, wich work at my Cassandra cluster verry well.
The general steps to follow for restoring a snapshot is:
1.Shutdown Cassandra if still running.
2.Clear any existing data in commitlogs, data and saved caches directories
3.Copy snapshots to relevant data directories
4.Copy incremental backups to data directory (if incremental backups are enabled)
If required, set restore_point_in_time parameter in commitlog_archiving.properties to
restore point.
5.Start Cassandra.
6.Run repair
So try running repair after copying data.
I have a single-node Cassandra setup for my application. To reclaim disk space occupied by deleted records (tombstoned records), I triggered a nodetool compact for my keyspace. Unfortunately, this compaction process got interrupted. Now, when I try to re-start the service, it does not recognise the keyspace (from the data directory configured in cassandra.yaml) for which compaction was in progress when it got interrupted. Other keyspaces like system and system_traces are successfully initiated from the same data directory.
Has anybody encountered a similar issue before? Also, pointers to restore a keyspace only from data files would be of great help (for the lack of maintenance of snapshots).
PS: Upon analysing further it was found that an rm command on the cassandra data directory was issued but immediately cancelled. Most of the data seems to be in place, but there is a chance that the Data.db file of the system keyspace was lost. Is there a way to recover from this state?
Seems like you have corrupted your setup by deleting System keyspace files, hence Cassandra might not be checking the same at boot time.
Try this:
Download same version of cassandra again.
Create your keyspace & cf schemas
Move whatever old data is left to new data directory(cassandra will only load the non-corrupted data) -
sudo mv /data/cassandra_old/data/[keyspace]/[cf]-[md5-old]/* /data/cassandra_new/data/[keyspace]/[cf]-[md5-new]/
It should solve it if I understand the problem correctly.
Offsite backups for Cassandra seem like a challenging thing. You basically have to make yet another copy of ALL your data, including the copies of data that exist due to the replication factor. Snapshots make backups easy when you don't mind storing it on the same disk that your node already uses. I'm curious - in the event of a catastrophic failure of this disk, is it possible to recover the node using the nodes that the data was replicated to?
Yes, you can restore data on crashed node using a procedure in documentation - Replacing a dead node or dead seed node. It's for Cassandra 3.x, please pick your Cassandra version from a drop-down menu on the top of the page.
But please note that you still need to do backups if your data is valuable. If you using AWS you can use this project to backup Cassandra to S3 storage.
If you are looking for offsite or off-host backups, you can also look at opscenter from Datastax or Talena software (my company). Both provide you the ability to backup your database locally or to S3. As you may expect, you also have the ability to restore data in case of hardware failures, user errors or logical corruptions which the replicas will not protect you against.
Yes, it is possible. Just execute in terminal "nodetool repair" on the node with missed data. It can take a lot of time. Also I would recommend execute repair operation on each node every month to keep your data always replicated because cassandra does not repairs data automatically (for example after node(s) falling).
I am trying to backup the whole cluster consistently. What are different ways to backup and restore Cassandra cluster?
If you are using the DataStax Enterprise version, then the easiest way is to perform the backups and restore using OpsCenter.
If you are using the DataStax Community or open-sourced version of Cassandra, then use nodetool snapshot to create backups of tables and/or keyspaces.
Please bear in mind that SSTables are immutable, i.e. they never change once they are written to disk. So unlike RDBMS data files, SSTables are not updated.
To perform a snapshot cluster-wide, use SSH tools such as pssh to perform parallel snapshots on all nodes.
More information on the snapshot utility is available here.
There are several ways to restore from snapshots. One way is to re-load the data using the sstableloader tool where the data is read back into the cluster. Another way is by copying the SSTable directory from snapshot and running nodetool refresh. Finally, you can replace the existing data with the snapshot and restarting the node.
More information on backups and restores are available here.