How does cassandra split keyspace data when multiple directories are configured? - cassandra

I have configured three separate data directories in cassandra.yaml file as given below:
data_file_directories:
- E:/Cassandra/data/var/lib/cassandra/data
- K:/Cassandra/data/var/lib/cassandra/data
when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?

You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.
This also let's you take advantage of the disk_failure_policy setting. You can read about the details here:
http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2
In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).
If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.
You can read about the thought process behind the development of this issue here:
https://issues.apache.org/jira/browse/CASSANDRA-4292

Some what I am able to guess how the keyspace is split between multiple data directories. Based on the maximum available space and load on directories, SSTables of same column family written to the different data directories..

Related

How to calculate "Concurrent_reads" parameter in cassandra DB

How to calculate and set the "concurrent_read" parameter in cassandra.yaml file in cassandra DB.
It shows "cassandra_read" is calculated as 16*num of drives.
My question is ,what exactly this "num of drives" is and how to calculate it??
Assuming system has 8 cores, 32 GB RAM and 1TB of hard disk.
One way to configure Apache Cassandra's data directories, was to use multiple data drives. In the bare metal world of a few years ago, this was usually multiple, physical disks. The way to configure them in the cassandra.yaml would be like this:
data_file_directories:
- /data01
- /data02
- /data03
- /data04
This assumes that the Cassandra instance has four physical drives, attached on the data0[1-4] mount points. Cassandra would then treat these directories in a JBOD (just a bunch of disks) fashion, spreading data evenly across them.
In this case, computing concurrent_reads with your formula above would be 16 x 4, as there are four drives. Given the emergence of solid state drives, the use of multiple, physical (or logical) disks isn't done much today (in my experience).
tl;dr;
If you're unsure of how many drives you have, check your mount points (df -h, /etc/fstab, etc...). Or you will probably be fine assuming one, and adjusting that calculation based on your available compute resources.

Hazelcast - PartitionGroup + Multiple Backups

Assuming 4 nodes split across 2 data centers (DC1-1, DC1-2, DC2-1, DC2-2).
Using partition groups and the default backup count of 1, the documentation and other questions/articles are pretty clear about how data is distributed assuming well distributed data - 25% per node as primary, all the primary data in DC1-1/DC1-2 will be backed up on either DC2-1/DC2-2 and vice versa.
It is not clear what the expected behavior is under same situation if we were to increase backup count to 2. Assuming entry #1 currently as primary on DC1-1. Would the two entries of backup both be forced to the two DC2 nodes? Is there a way to make it such that there is one backup in each partitiongroup? (i.e. primary DC1-1, backup on DC1-2, backup on either DC2-1 or DC2-2)?
Thanks
First of all we do not recommend to split a single cluster over multiple data centers. There are possible exceptions but keep in mind that latency between data centers is important as you partition the data.
To your question:
If you have just two partition groups defined there is no way to create more than one backup. You have to imagine a normal cluster to be one node per partition group, therefore you can have pG-1 backups. If you change the configuration to 2 partition groups that means you can only have one backup.

Change replication factor of selected objects

Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.

Cassandra multiple disk per node setup

Intro
I have a cassandra 1.2 cluster, all the nodes have SSDs. Now I want to add more disks to the existing nodes, but I want to be able to choose which tables are stored on different disks.
Problem
For example, node 1 will have 3 SSDs and 1 regular disk drive and I want all the column families except 1 (let's call it "discord" table) to be stored on the SSDs only, the final table "discord" needs to be stored on the regular disk.
According to the documentation this should be possible; however, the only way of doing it that I can see is:
Setting up Cassandra to use multiple data_files_directories in cassandra.yaml.
Creating the tables.
Creating a link from the data directory on each SSD to the directory on the hard disk where I want to store the column family.
Question
Is this the only way of doing it? Or there is a simpler way of configuring a node to work in this way?
You can set multiples files using the data_file_directories property, but the data is distributed over the folders internally by Cassandra. You can not take decisions on which keyspace or column family goes to each directory.
So the symbolic links is the way to go in my opinion.

moving Cassandra snapshots to a different disk/server/datacenter

I have Cassandra 1.2.6 cluster running on datacenter A, each node has a solid state drive with somewhat limited space (aprox 50% of disk space is free).
Now I need to implement somehow a way of having automatic backups of each node. Ideally I want to have a way of moving all of the cluster's datafiles to a different disk (standard cheaper disks), or even to a different server in the same datacenter A and possibly moving all the data once in a while to a datacenter B in a different location.
From what I've read I can use snapshots on each node to get the files to copy using whatever tool I want and in this case I have the option to move the data to a different disk/server/datacenter.
My question is, since each of my nodes is about 50% full, taking a snapshot will consume all that space? or the hard links will consume way less space than I anticipate?, if so, is there a better way of doing this, maybe with an already made tool, or everything should be custom made when it comes to this type of backups in Cassandra?
Thanks in advance!
A hard link just creates a new directory entry for the same file (http://en.wikipedia.org/wiki/Hard_link). So a snapshot takes up effectively zero space, but you'll want to clean it up after you're done copying it off to whatever your archive is, because when the "original" sstable is deleted (typically post-compaction), space won't be reclaimed as long as the snapshot reference is still there.
My impression is that tablesnap is the most popular tool for automating backups to s3. It also supports Cassandra incremental backups. If you want more control over where you're backing up to, DataStax OpsCenter supports running a custom script when it takes snapshots.

Resources