Datastax Cluster Storage Amazon Ec2 - Production - cassandra

I have a Datastax Enterprise cluster in production with the following configuration:
3 Hadoop Nodes
2 Cassandra Nodes
2 Solr Nodes
There are few tables in Cassandra with few million lines.
Every night I process few million registers using PIG.
All the search done in our website uses SOLR.
Basically we are 100% based on DSE.
This structure is based on Amazon Ec2, and all the instances are:
M3.Xlarge
80 GB SSD
15 GB RAM
13 (4 core x 3.25 unit)
I want to add an extra 1TB Hard Disk for each node and use it in the cluster.
How can I do that? Which config files do I need to change when I attach a new hard disk?

After attaching the new hard drive storage to the EC2 instance, edit the cassandra.yaml file and add the new storage location to the data_file_directories configuration option. (Cassandra supports multiple entries for data storage, and will spread the data out.)
The config file will depend on your installation method, either /etc/dse/cassandra/cassandra.yaml or {install_location}/resources/cassandra/conf/cassandra.yaml.
After making the config file change, DSE will need to be restarted on each node (could do a rolling restart).
Reference: https://stackoverflow.com/a/23121664/9965

Related

Cassandra Cluster Migration Issues

Now:
Single Node
cassandra 3.11.3 + kairosdb 1.2
Two data storage path
/data/cassandra/data/kairosdb 4T Old data
/data1/cassandra/data/kaiosdb 1.1T Now Wrting data
target:
Three Node
cassandra 3.11.3 + kairosdb 1.2
One data storage path
/data/cassandra/data/kairosdb
In this case, how to migrate the data in two data directories under a single node to a three-node cluster, each node of this three-node cluster has only one data directory
I understand how to do it (and have practiced it) when migrating a single node to a three-node cluster, but only when there is only one data directory.2 data directories are migrated to 1,I have searched the Internet for a long time, but there is no reference material.
Data directories are something that the individual Cassandra node cares about but the Cluster doesn't.
Usually you'd want to have all nodes share the same configuration but for replication it really doesn't matter where the SSTables are on Disk on each node.
So migrating here would be the same as you've practiced.
That said the process I'd choose would be to add the new nodes as a second DC with the right replication, run a repair to have all the data in sync and then decommission the original node.

Cassandra global snapshot

I am running a cluster with 3 nodes(EC2 instances) and replication factor=2. I execute a script from the first node which runs nodetool snapshot on all the nodes using pssh (parallel-ssh) utility. But the snapshot data for each node gets stored on that node itself. Is there a way we can get snapshot data of all nodes to the node from where I ran the script so that my script can easily copy the data to S3 from a single place?
Also,
Suppose if I have a 5 node cluster and I have snapshots for each node. Now I want to restore this data to a 10 node clusters and a 2 node cluster with different replication factors. Is the below process correct for restore?
copy snapshot data from all the 5 nodes and merge all the files into a single folder.
run sstableloader command passing all the IP addresses (which are 10 or 2 in number) and single folder location. Will this properly split the data from 5 node to 10 or 2 nodes after restore ?
I strongly suggest to use the Medusa tool (doc) for backup & restore of your Cassandra cluster(s) - it's able to backup data to cloud storage, and you can restore data to clusters, even with different topologies.

Cassandra cluster bulk loader hangs during export

I am migrating a simple 4 node Cassandra cluster from one cloud provider to another. The number of nodes in both the clouds are same however the newer cluster is at version 3.11.0 and the older one is at 3.0.11. I am using sstableloader to stream data from one cluster to another (schema has been created on new cluster separately). As per the release notes this should not be a problem.
However, for certain column families with sstableloader I get progress to 100% but then it hangs there for hours (time hang >> time to stream). The total data to stream on each node is below 500 GB. Any help on why this is happening and how to avoid is appreciated.
Create a new node and add to the existing cluster from the new cloud server
Flush tables from the memtable to SSTables on disk
Delete one node from old cloud server. Like wise repeat for each node.

Cassandra cluster on budget

I am learning Cassandra and want to run a cloud based cluster. I don't care much about speed.
What I want to really test is the replication and recovery features.
I would be running tests like
taking nodes offline every once in a while
kill -9 cassandra
powering off server
manually corrupting sstables/commitlog (not sure if this is recoverable)
I am thinking of going for a 4 node cluster.
Each node will have the following config:
2 GB RAM
10 GB SSD
2 CPUs (Virtual)
Two nodes will be in a European datacenter and other two will be in a North American data center.
I know 8GB is the recommended minimum for Cassandra. But that config would be quite expensive.
If it helps, I can run one more VM on a dedicated box. This VM can have 16 GB RAM and 8 virtual CPUs. I could also run 4 VMs with 4GB RAM each on this box. But I guess, having 4 separate VMs in different data centers would make a more realistic setup and bring to fore any issues that may arise out of network problems, latencies etc.
Is it okay to run Cassandra on machines with this config? Please share your thoughts.
Many people run multiple instances of cassandra on modern laptops using ccm ( https://github.com/pcmanus/ccm ). If you just want to get an idea of what it does (create a 3 node cluster, add data, add a 4th node, create a snapshot, remove a node, add it back, restore the snapshot, etc), using ccm on a PC may be 'good enough'.
Otherwise, you can certainly run with less than 1GB of ram, but it's not always fun. There have been some clusters on tiny hardware ( http://www.datastax.com/dev/blog/32-node-raspberry-pi-cassandra-cluster ). Depending on your budget, making a cluster on raspberry pi's may be as cost effective as your 2 VM cluster.

Priam backup automatic restore

I have a Cassandra cluster managed by Priam, with 3 nodes. I use ephemeral disks to store my Cassandra data, so when I start 1 node, the Cassandra data dir is empty.
I have Priam properly configured and I can see backups are saved in Amazon S3. Suppose a node goes down and then I start another node. Will Priam know how to automatic restore backup from S3 when the node comes up again? The Cassandra data dir will start empty, so I am assuming Priam would give the new node the same token as the old one and it would restore the data... Right?
Yes. I have been running standalone Cassandra on EC2, small Cassandra clusters on mesos on EC2, and larger DataStax Enterprise clusters (with Cassandra) on EC2.
I have been using the Priam 3.x branch.
On restore, it calculates the initial_token, updates the cassandra.yaml file, restores the snapshot and incremental backup files, and restarts Cassandra.
According to Priam/Netflix conventions, if you have a 3 node cluster with Cassandra, your nodes should be named some_thing-other-things. Each node should be a part of an Auto-scaling group called some_thing. Each node should also use a Security Group named some_thing.
Create a 3 node dev cluster and test your backups and restores with data that you can easily recreate, that you don't care about too much. Get used to managing the Auto-scaling groups and Priam. Then, try it on test clusters with data that you care about.

Resources