Problems In Cassandra ByteOrderedPartitioner in Cluster Environment - cassandra

I am using cassandra 1.2.15 with ByteOrderedPartitioner in a cluster environment of 4 nodes with 2 replicas. I want to know what are the drawbacks of using the above partitioner in cluster environment? After a long search I found one drawback. I need to know what are the consequences of such drawback?
1) Data will not distribute evenly.
What type of problem will occur if data are not distributed evenly?
Is there is any other drawback with the above partitioner in cluster environment if so, what are the consequences of such drawbacks? Please explain me clearly.
One more question is, Suppose If I go with Murmur3Partitioner the data will distribute evenly. But the order will not be preserved, however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?

As you are using Cassandra 1.2.15, I have found a doc pertaining to Cassandra 1.2 which illustrates the points behind why using the ByteOrderedPartitioner (BOP) is a bad idea:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePartitionerBOP_c.html
Difficult load balancing More administrative overhead is required to load balance the cluster. An ordered partitioner
requires administrators to manually calculate partition ranges
(formerly token ranges) based on their estimates of the row key
distribution. In practice, this requires actively moving node
tokens around to accommodate the actual distribution of data once
it is loaded.
Sequential writes can cause hot spots If your application tends to write or update a sequential block of rows at a time, then the
writes are not be distributed across the cluster; they all go to
one node. This is frequently a problem for applications dealing
with timestamped data.
Uneven load balancing for multiple tables If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered
partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
For these reasons, the BOP has been identified as a Cassandra anti-pattern. Matt Dennis has a slideshare presentation on Cassandra Anti-Patterns, and his slide about the BOP looks like this:
So seriously, do not use the BOP.
"however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?"
Somewhat, yes. In Cassandra you can dictate the order of your rows (within a partition key) by using a clustering key. If you wanted to keep track of (for example) station-based weather data, your table definition might look something like this:
CREATE TABLE stationreads (
stationid uuid,
readingdatetime timestamp,
temperature double,
windspeed double,
PRIMARY KEY ((stationid),readingdatetime));
With this structure, you could query all of the readings for a particular weather station, and order them by readingdatetime. However, if you queried all of the data (ex: SELECT * FROM stationreads;) the results probably will not be in any discernible order. That's because the total result set will be ordered by the (random) hashed values of the partition key (stationid in this case). So while "yes" you can order your results in Cassandra, you can only do so within the context of a particular partition key.
Also, there have been many improvements in Cassandra since 1.2.15. You should definitely consider using a more recent (2.x) version.

Related

Client data isolation: can Cassandra store data in different partitions in separate file sets?

Suppose I have a Cassandra table with an integer partition key.
Question: is it possible to arrange for Cassandra to store the table data and indexes for that table in a sets of files by partition value? Alternative approaches like per partition keyspaces or duplicating tables Account1 (for partition key 1), Account2 (for partition key 2) is deemed to undercut Cassandra performance.
The desired outcome is to reduce the possibility of selecting sensitive client data for partition 1 getting other partitions in the process. If the data is kept separate (and searched separately) this risk is reduced --- obviously not eliminated. Essentially it shifts the responsibility of using the right partition key at the right time somewhat onto Cassandra from the application code.
It's not possible in the Cassandra itself, until you separate data into tables/keyspaces, but as you mentioned - it will lead to bad performance.
DataStax Enterprise (DSE) has functionality called Row Level Access Control that allows you to set permissions based on the value of partition key (or part of partition key).
If you need to stick to plain Cassandra, then you need to do it on the application level.

Replication without partitioning in Cassandra

In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views

Order Partitionning vs Random Partitioning

According to most articles on internet Random Partitioning(RP) is better than Ordered Partitioning(OP) cause of the data distribution.
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
thanks a lot
I can't answer really answer confidently for HBase (which only supports Ordered Partitioning to my knowledge), but for Cassandra I would strongly discourage the use of OrderPreservingPartitioner and ByteOrderedPartitioner unless you have a very specific use case that requires it (like if you need to do range scans across keys). It is not very common for Ordered Partitioner to be used
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
Not particularly, it is much more likely for hotspots to be encountered with an Ordered Partitioner vs. a Random Partitioner. As described from the Partitioners page on the Cassandra Wiki:
globally ordering all your partitions generates hot spots: some partitions close together will get more activity than others, and the node hosting those will be overloaded relative to others. You can try to mitigate with active load balancing but this works poorly in practice; by the time you can adjust token assignments so that less hot partitions are on the overloaded node, your workload often changes enough that the hot spot is now elsewhere. Remember that preserving global order means you can't just pick and choose hot partitions to relocate, you have to relocate contiguous ranges.
There are other problems with Ordered Partitioning that are described well here:
Difficult load balancing:
More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges based on their estimates of the partition key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
Uneven load balancing for multiple tables:
If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
With regards to:
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
You will definitely achieve better performance for range scans (i.e. get all data between this key and that key).
So it really comes down to the kind of queries you are making. Are range scan queries between keys vital to you? In that case HBase may be a more appropriate solution for you. If it is not as important, there are reasons to consider C* instead. I won't add much more to that as I don't want my answer to devolve into comparing the two solutions :).

Cassandra: Controlling which node receives data

My understanding of Cassandra's recommended clustering approach is to ensure that each node in the cluster receives an equal distribution of data, by hashing a document's unique Id. My question is if there is a way to change this and define a custom key for "intelligently" routing a document to a specific node in the cluster?
In my scenario, I have data which relates to a specific entity (think client-project-task-item) Across all my data; I will have enough items to require some horizontal scaling; however, each search will always relate to a given client-project-task for which the data set is only a moderate size.
Is there a way to create this type of partitioning / routing (different names I've seen for the same thing) logic in Cassandra?
Thanks; Brent
Clustering approach in Cassandra is not just for an equal distribution of data. It also ensures that all read/write operations are distributed across the cluster to make these operations faster. In addition to this, most likely you will have replication factor greater than 1 to ensure data redundancy so that a node failure does not result in the data loss.
Back to your question and to your own answer. If you use the same partition key for the data, this guarantees that Cassandra partitioning will store the primary replica of the data on the same node, and even more, it will store them in the same partition, ("wide row" in an old way of naming).
I think - http://www.datastax.com/documentation/cql/3.0/share/glossary/gloss_partition_key.html - is the answer I'm looking for
The first column declared in the PRIMARY KEY definition, or in the case of a compound key, multiple columns can declare those columns that form the primary key.

Change replication factor of selected objects

Is there any cloud storage system (i.e Cassandra, Hazelcast, Openstack Swift) where we can change the replication factor of selected objects? For instance lets say, we have found out hotspot objects in the system so we can increase the replication factor as a solution?
Thanks
In Cassandra the replication factor is controlled based on keyspaces. So you first define a keyspace by specifying the replication factor the keyspace should have in each of your data centers. Then within a keyspace, you create database tables, and those tables are replicated according to the keyspace they are defined in. Objects are then stored in rows in a table using a primary key.
You can change the replication factor for a keyspace at any time by using the "alter keyspace" CQL command. To update the cluster to use the new replication factor, you would then run "nodetool repair" for each node (most installations run this periodically anyway for anti-entropy).
Then if you use for example the Cassandra java driver, you can specify the load balancing policy to use when accessing the cluster, such as round robin, and token aware policy. So if you have multiple replicas of the the table holding the objects, then the load of accessing the object could be set to round robin on just the nodes that have a copy of the row you are accessing. If you are using a read consistency level of ONE, then this would spread out the read load.
So the granularity of this is not at the object level, but at the table level. If you had all your objects stored in one table, then changing the replication factor would change it for all objects in that table and not just one. You could have multiple keyspaces with different replication factors and keep high demand objects in a keyspace with a high RF, and less frequently accessed objects in a keyspace with a low RF.
Another way you could reduce the hot spot for an object in Cassandra is to make additional copies of it by inserting it into additional rows of a table. The rows are accessed on nodes by the compound partition key, so one field of the partition key could be a "copy_number" value, and when you go to read the object, you randomly set a copy_number value (from 0 to the number of copy rows you have) so that the load of reading the object will likely hit a different node for each read (since rows are hashed across the cluster based on the partition key). This approach would give you more granularity at the object level compared to changing the replication factor for the whole table, at the cost of more programming work to manage randomly reading different rows.
In Infinispan, you can also set number of owners (replicas) on each cache (equivalent to Hazelcast's map or Cassandra's table), but not for one specific entry. Since the routing information (aka consistent hash table) does not contain all keys but splits the hashCode() 32-bit range to variable amount of segments, and then specifies the distribution only for these segments, there's no way to specify the number of replicas per entry.
Theoretically, with specially forged keys and custom consistent hash table factory, you could achieve something similar even in one cache (certain sorts of keys would be replicated different amount of times), but that would require coding with deep understanding of the system.
Anyway, the reader would have to know the number of replicas in advance as this would be part of the routing information (cache in simple case, special keys as described above), therefore, it's not really practical unless the reader can know that.
I guess you want to use the replication factor for the sake of speeding up reads.
The regular Map (IMap) implementation, uses a master slave(s) setup, so all reads will go through the master. But there is a special setting available, so you are also allowed to read from backups. So if you have a 10 node cluster, and have a backup count of 5, there will be in total 6 members that have the information stored. 5 members in the cluster will hit the master, and 5 members in the cluster will hit the backup (since they have the backup locally available).
There also is a fully replicated map available, here every item is send to every machine. So in a 10 node cluster, all reads will be local since every machine has the same data.
In case of the IMap, we don't provide control on the number of backups on the key/value level. So the whole map is configured with a certain backup-count.

Resources