DynamoDB and Cassandra partitioning strategy - cassandra

In the Dynamo paper, the author introduced 3 different partitioning strategy:
It seems DynamoDB has evolved from strategy 1 to strategy 3. I have a few questions related to strategy 3:
Refer to:
Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items). This simplifies the process of bootstrapping and recovery.
How is it managed at low level? One node can have a few partitions assigned to it. Is each partition handled separately inside the storage engine? For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces? This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage? For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node? How do we handle this situation in DynamoDB?
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?

Trying to answer each question in turn:
One node can have a few partitions assigned to it.
Each node will have 1 or more token ranges assigned during bootstrapping - depending on the partitioner this is a numeric range -2^63 to +2^63 for the murmur or 0 to 2^128 for random partitioner.
Each token here can contain a partition (but might not), so while you are thinking of it as the node owning partitions, strictly speaking it is owning token ranges.
Is each partition handled separately inside the storage engine?
This question doesn't really follow - an SSTable can contain 1 or more partitions. A partition can be contained in 1 or more SSTables - e.g. the partition span SSTables.
For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces?
No, there will be a memtable for the database table, and then these are flushed to create the SSTables - the compaction of the multiple SStables is determined by the compaction strategy setting, with there being quite different behaviours and advantages / disadvantages to each, depending on the usage scenario. 1 size, does not fit all. Again, each SSTable can contain multiple partitions, and a partition can appear in more than 1 SSTable.
This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
Compaction itself is not a trivial topic, but since the initial premise is not correct, it has not introduced this.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage?
Writing specifically about Cassandra - every time you add or remove a node the token ranges that belong to each node can and will alter. So it is not entirely 'static', but it is not easy to change or manipulate either.
For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node?
Again - specific to Cassandra, in theory yes - you calculate the hash value of the partition key, and use initial_token values on a node to give it a very narrow range. In practice no - this is a data model design issue, by the fact that its partitioned in a way which has created a hot spot.
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Using num_tokens, which creates vNodes - is in effect dividing the consistent hash ring up more times, so 10 nodes, num_tokens = 16, means that the overall token range is divided into 160 slices, with each node having 10 of them as their partition range. They will hold replicas of other node's ranges of course based on replication factor and rack assignments. If you only had RF=1, then they would only be storing data for the range(s) they are assigned.
Initial_tokens is the setting to control the initial value when the node is bootstrapped - you can choose to calculate it and set it manually, or you can let the partitioner calculate it for you. Further changes on that setting after bootstrap will not have an impact.

Related

Replication without partitioning in Cassandra

In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views

Order Partitionning vs Random Partitioning

According to most articles on internet Random Partitioning(RP) is better than Ordered Partitioning(OP) cause of the data distribution.
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
thanks a lot
I can't answer really answer confidently for HBase (which only supports Ordered Partitioning to my knowledge), but for Cassandra I would strongly discourage the use of OrderPreservingPartitioner and ByteOrderedPartitioner unless you have a very specific use case that requires it (like if you need to do range scans across keys). It is not very common for Ordered Partitioner to be used
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
Not particularly, it is much more likely for hotspots to be encountered with an Ordered Partitioner vs. a Random Partitioner. As described from the Partitioners page on the Cassandra Wiki:
globally ordering all your partitions generates hot spots: some partitions close together will get more activity than others, and the node hosting those will be overloaded relative to others. You can try to mitigate with active load balancing but this works poorly in practice; by the time you can adjust token assignments so that less hot partitions are on the overloaded node, your workload often changes enough that the hot spot is now elsewhere. Remember that preserving global order means you can't just pick and choose hot partitions to relocate, you have to relocate contiguous ranges.
There are other problems with Ordered Partitioning that are described well here:
Difficult load balancing:
More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges based on their estimates of the partition key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
Uneven load balancing for multiple tables:
If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
With regards to:
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
You will definitely achieve better performance for range scans (i.e. get all data between this key and that key).
So it really comes down to the kind of queries you are making. Are range scan queries between keys vital to you? In that case HBase may be a more appropriate solution for you. If it is not as important, there are reasons to consider C* instead. I won't add much more to that as I don't want my answer to devolve into comparing the two solutions :).

How does Cassandra partitioning work when replication factor == cluster size?

Background:
I'm new to Cassandra and still trying to wrap my mind around the internal workings.
I'm thinking of using Cassandra in an application that will only ever have a limited number of nodes (less than 10, most commonly 3). Ideally each node in my cluster would have a complete copy of all of the application data. So, I'm considering setting replication factor to cluster size. When additional nodes are added, I would alter the keyspace to increment the replication factor setting (nodetool repair to ensure that it gets the necessary data).
I would be using the NetworkTopologyStrategy for replication to take advantage of knowledge about datacenters.
In this situation, how does partitioning actually work? I've read about a combination of nodes and partition keys forming a ring in Cassandra. If all of my nodes are "responsible" for each piece of data regardless of the hash value calculated by the partitioner, do I just have a ring of one partition key?
Are there tremendous downfalls to this type of Cassandra deployment? I'm guessing there would be lots of asynchronous replication going on in the background as data was propagated to every node, but this is one of the design goals so I'm okay with it.
The consistency level on reads would probably generally be "one" or "local_one".
The consistency level on writes would generally be "two".
Actual questions to answer:
Is replication factor == cluster size a common (or even a reasonable) deployment strategy aside from the obvious case of a cluster of one?
Do I actually have a ring of one partition where all possible values generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If I were to use a write consistency of "one" does Cassandra always write the data to the node contacted by the client?
Are there other downfalls to this strategy that I don't know about?
Do I actually have a ring of one partition where all possible values
generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If all of my nodes are "responsible" for each piece of data regardless
of the hash value calculated by the partitioner, do I just have a ring
of one partition key?
Not exactly, C* nodes still have token ranges and c* still assigns a primary replica to the "responsible" node. But all nodes will also have a replica with RF = N (where N is number of nodes). So in essence the implication is the same as what you described.
Are there tremendous downfalls to this type of Cassandra deployment?
Are there other downfalls to this strategy that I don't know about?
Not that I can think of, I guess you might be more susceptible than average to inconsistent data so use C*'s anti-entropy mechanisms to counter this (repair, read repair, hinted handoff).
Consistency level quorum or all would start to get expensive but I see you don't intend to use them.
Is replication factor == cluster size a common (or even a reasonable)
deployment strategy aside from the obvious case of a cluster of one?
It's not common, I guess you are looking for super high availability and all your data fits on one box. I don't think I've ever seen a c* deployment with RF > 5. Far and wide RF = 3.
If I were to use a write consistency of "one" does Cassandra always
write the data to the node contacted by the client?
This depends on your load balancing policies at the driver. Often we select token aware policies (assuming you're using one of the Datastax drivers), in which case requests are routed to the primary replica automatically. You could use round robin in your case and have the same effect.
The primary downfall will be increased write costs at the coordinator level as you add nodes. The maximum number of replicas written to I've seen is around 8 (5 for other data centers and 3 for local replicas).
In practice this will mean a reduced stability while performing large or batched writes (greater than 1mb) or a lower per node write TPS.
The primary advantage is you can do a lot of things that'd normally be awful and impossible to do. Want to use secondary indexes? probably will work reasonably well (assuming cardinality and partition size doesn't become your bottleneck there). Want to add a custom UDF that does GroupBy or use very large IN queries it'll probably work.
It is as #Phact mentions not a common usage pattern and I primarily saw it used with DSE Search on low write throughput use cases that had requirements for 'single node' features from Solr, but for those same use cases with pure Cassandra you'd get some benefits on the read side and be able to do expensive queries that are normally impossible in a more distributed cluster.

Change replication factor of selected objects

Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.

Problems In Cassandra ByteOrderedPartitioner in Cluster Environment

I am using cassandra 1.2.15 with ByteOrderedPartitioner in a cluster environment of 4 nodes with 2 replicas. I want to know what are the drawbacks of using the above partitioner in cluster environment? After a long search I found one drawback. I need to know what are the consequences of such drawback?
1) Data will not distribute evenly.
What type of problem will occur if data are not distributed evenly?
Is there is any other drawback with the above partitioner in cluster environment if so, what are the consequences of such drawbacks? Please explain me clearly.
One more question is, Suppose If I go with Murmur3Partitioner the data will distribute evenly. But the order will not be preserved, however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?
As you are using Cassandra 1.2.15, I have found a doc pertaining to Cassandra 1.2 which illustrates the points behind why using the ByteOrderedPartitioner (BOP) is a bad idea:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePartitionerBOP_c.html
Difficult load balancing More administrative overhead is required to load balance the cluster. An ordered partitioner
requires administrators to manually calculate partition ranges
(formerly token ranges) based on their estimates of the row key
distribution. In practice, this requires actively moving node
tokens around to accommodate the actual distribution of data once
it is loaded.
Sequential writes can cause hot spots If your application tends to write or update a sequential block of rows at a time, then the
writes are not be distributed across the cluster; they all go to
one node. This is frequently a problem for applications dealing
with timestamped data.
Uneven load balancing for multiple tables If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered
partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
For these reasons, the BOP has been identified as a Cassandra anti-pattern. Matt Dennis has a slideshare presentation on Cassandra Anti-Patterns, and his slide about the BOP looks like this:
So seriously, do not use the BOP.
"however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?"
Somewhat, yes. In Cassandra you can dictate the order of your rows (within a partition key) by using a clustering key. If you wanted to keep track of (for example) station-based weather data, your table definition might look something like this:
CREATE TABLE stationreads (
stationid uuid,
readingdatetime timestamp,
temperature double,
windspeed double,
PRIMARY KEY ((stationid),readingdatetime));
With this structure, you could query all of the readings for a particular weather station, and order them by readingdatetime. However, if you queried all of the data (ex: SELECT * FROM stationreads;) the results probably will not be in any discernible order. That's because the total result set will be ordered by the (random) hashed values of the partition key (stationid in this case). So while "yes" you can order your results in Cassandra, you can only do so within the context of a particular partition key.
Also, there have been many improvements in Cassandra since 1.2.15. You should definitely consider using a more recent (2.x) version.

Resources